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Abstract 

We consider a problem of considerable practical interest: the recovery of a data matrix from 
a sampling of its entries. Suppose that we observe m entries selected uniformly at random from 
a matrix M. Can we complete the matrix and recover the entries that we have not seen? 

We show that one can perfectly recover most low-rank matrices from what appears to be an 
incomplete set of entries. We prove that if the number m of sampled entries obeys 

m > Cn 12 r log n 

for some positive numerical constant C, then with very high probability most n x n matrices 
of rank r can be perfectly recovered by solving a simple convex optimization program. This 
program finds the matrix with minimum nuclear norm that fits the data. The condition above 
assumes that the rank is not too large. However, if one replaces the 1.2 exponent with 1.25, 
then the result holds for all values of the rank. Similar results hold for arbitrary rectangular 
matrices as well. Our results are connected with the recent literature on compressed sensing, 
and show that objects other than signals and images can be perfectly reconstructed from very 
limited information. 

Keywords. Matrix completion, low-rank matrices, convex optimization, duality in optimiza- 
tion, nuclear norm minimization, random matrices, noncommutative Khintchine inequality, decou- 
pling, compressed sensing. 

1 Introduction 

In many practical problems of interest, one would like to recover a matrix from a sampling of its 
entries. As a motivating example, consider the task of inferring answers in a partially filled out 
survey. That is, suppose that questions are being asked to a collection of individuals. Then we 
can form a matrix where the rows index each individual and the columns index the questions. 
We collect data to fill out this table but unfortunately, many questions are left unanswered. Is it 
possible to make an educated guess about what the missing answers should be? How can one make 
such a guess? Formally, we may view this problem as follows. We are interested in recovering a 
data matrix M with n\ rows and rc-2 columns but only get to observe a number m of its entries 
which is comparably much smaller than riiri2, the total number of entries. Can one recover the 
matrix M from m of its entries? In general, everyone would agree that this is impossible without 
some additional information. 
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In many instances, however, the matrix we wish to recover is known to be structured in the 
sense that it is low-rank or approximately low-rank. (We recall for completeness that a matrix with 
n\ rows and 112 columns has rank r if its rows or columns span an r-dimensional space.) Below are 
two examples of practical scenarios where one would like to be able to recover a low-rank matrix 
from a sampling of its entries. 

• The Netflix problem. In the area of recommender systems, users submit ratings on a subset 
of entries in a database, and the vendor provides recommendations based on the user's pref- 
erences [28,32]. Because users only rate a few items, one would like to infer their preference 
for unrated items. 

A special instance of this problem is the now famous Netflix problem [2] . Users (rows of the 
data matrix) are given the opportunity to rate movies (columns of the data matrix) but users 
typically rate only very few movies so that there are very few scattered observed entries of 
this data matrix. Yet one would like to complete this matrix so that the vendor (here Netflix) 
might recommend titles that any particular user is likely to be willing to order. In this case, 
the data matrix of all user-ratings may be approximately low-rank because it is commonly 
believed that only a few factors contribute to an individual's tastes or preferences. 

• Triangulation from incomplete data. Suppose we are given partial information about the dis- 
tances between objects and would like to reconstruct the low-dimensional geometry describing 
their locations. For example, we may have a network of low-power wirelessly networked sen- 
sors scattered randomly across a region. Suppose each sensor only has the ability to construct 
distance estimates based on signal strength readings from its nearest fellow sensors. From 
these noisy distance estimates, we can form a partially observed distance matrix. We can 
then estimate the true distance matrix whose rank will be equal to two if the sensors are 
located in a plane or three if they are located in three dimensional space [24,31]. In this case, 
we only need to observe a few distances per node to have enough information to reconstruct 
the positions of the objects. 

These examples are of course far from exhaustive and there are many other problems which fall in 
this general category. For instance, we may have some very limited information about a covariance 
matrix of interest. Yet, this covariance matrix may be low-rank or approximately low- rank because 
the variables only depend upon a comparably smaller number of factors. 

1.1 Impediments and solutions 

Suppose for simplicity that we wish to recover a square n x n matrix M of rank Such a 
matrix M can be represented by n 2 numbers, but it only has (2n — r)r degrees of freedom. This 
fact can be revealed by counting parameters in the singular value decomposition (the number of 
degrees of freedom associated with the description of the singular values and of the left and right 
singular vectors). When the rank is small, this is considerably smaller than n 2 . For instance, when 
M encodes a 10-dimensional phenomenon, then the number of degrees of freedom is about 20 n 
offering a reduction in dimensionality by a factor about equal to ra/20. When n is large (e.g. in the 
thousands or millions), the data matrix carries much less information than its ambient dimension 

1 We emphasize that there is nothing special about M being square and all of our discussion would apply to 
arbitrary rectangular matrices as well. The advantage of focusing on square matrices is a simplified exposition and 
reduction in the number of parameters of which we need to keep track. 
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suggests. The problem is now whether it is possible to recover this matrix from a sampling of its 
entries without having to probe all the n 2 entries, or more generally collect r? or more measurements 
about M. 



1.1.1 Which matrices? 

In general, one cannot hope to be able to recover a low-rank matrix from a sample of its entries. 
Consider the rank-1 matrix M equal to 



M = eic* 



••• 1' 
••• 

••• 



(1.1) 



where here and throughout, e,- L is the zth canonical basis vector in Euclidean space (the vector with 
all entries equal to but the iih equal to 1). This matrix has a 1 in the top-right corner and all the 
other entries are 0. Clearly this matrix cannot be recovered from a sampling of its entries unless 
we pretty much see all the entries. The reason is that for most sampling sets, we would only get to 
see zeros so that we would have no way of guessing that the matrix is not zero. For instance, if we 
were to see 90% of the entries selected at random, then 10% of the time we would only get to see 
zeroes. 

It is therefore impossible to recover all low-rank matrices from a set of sampled entries but 
can one recover most of them? To investigate this issue, we introduce a simple model of low-rank 
matrices. Consider the singular value decomposition (SVD) of a matrix M 



M = Y,°kU k v* k , (1.2) 



k=l 

where the w^'s and v^s are the left and right singular vectors, and the oVs are the singular values 
(the roots of the eigenvalues of M*M) . Then we could think of a generic low-rank matrix as follows: 
the family {wfc}i<fc<r is selected uniformly at random among all families of r orthonormal vectors, 
and similarly for the the family {i>fc}i<fc<r- The two families may or may not be independent of 
each other. We make no assumptions about the singular values a^. In the sequel, we will refer to 
this model as the random orthogonal model. This model is convenient in the sense that it is both 
very concrete and simple, and useful in the sense that it will help us fix the main ideas. In the 
sequel, however, we will consider far more general models. The question for now is whether or not 
one can recover such a generic matrix from a sampling of its entries. 



1.1.2 Which sampling sets? 

Clearly, one cannot hope to reconstruct any low-rank matrix M — even of rank 1 — if the sampling 
set avoids any column or row of M. Suppose that M is of rank 1 and of the form xy*, x,y 6t n 
so that the (i, j)th entry is given by 

Mij = xiyj. 

Then if we do not have samples from the first row for example, one could never guess the value of 
the first component x±, by any method whatsoever; no information about x\ is observed. There is 
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of course nothing special about the first row and this argument extends to any row or column. To 
have any hope of recovering an unknown matrix, one needs at least one observation per row and 
one observation per column. 

We have just seen that if the sampling is adversarial, e.g. one observes all of the entries of M 
but those in the first row, then one would not even be able to recover matrices of rank 1. But what 
happens for most sampling sets? Can one recover a low-rank matrix from almost all sampling sets of 
cardinality ml Formally, suppose that the set Q of locations corresponding to the observed entries 
€ ^ if M{j is observed) is a set of cardinality m sampled uniformly at random. Then can 
one recover a generic low-rank matrix M, perhaps with very large probability, from the knowledge 
of the value of its entries in the set 



1.1.3 Which algorithm? 

If the number of measurements is sufficiently large, and if the entries are sufficiently uniformly 
distributed as above, one might hope that there is only one low-rank matrix with these entries. If 
this were true, one would want to recover the data matrix by solving the optimization problem 

minimize rank(X) . . 

subject to Xij = Mij G 

where X is the decision variable and rank(JC) is equal to the rank of the matrix X. The program 



(1.3) is a common sense approach which simply seeks the simplest explanation fitting the observed 
data. If there were only one low-rank object fitting the data, this would recover M. This is 
unfortunately of little practical use because this optimization problem is not only NP-hard, but all 
known algorithms which provide exact solutions require time doubly exponential in the dimension 
n of the matrix in both theory and practice [14] . 

If a matrix has rank r, then it has exactly r nonzero singular values so that the rank function 



in (1.3) is simply the number of nonvanishing singular values. In this paper, we consider an 
alternative which minimizes the sum of the singular values over the constraint set. This sum is 
called the nuclear norm, 

n 

11*11* = X>fcW (1-4) 

k=l 

where, here and below, a^{X) denotes the fcth largest singular value of X. The heuristic optimiza- 
tion is then given by 

minimize ||-^||* q r\ 

subject to X-ij = M-ij £ f2. 

Whereas the rank function counts the number of nonvanishing singular values, the nuclear norm 
sums their amplitude and in some sense, is to the rank functional what the convex i\ norm is to 
the counting £q norm in the area of sparse signal recovery. The main point here is that the nuclear 



norm is a convex function and, as we will discuss in Section 1.4 can be optimized efficiently via 
semidefinite programming. 



1.1.4 A first typical result 

Our first result shows that, perhaps unexpectedly, this heuristic optimization recovers a generic M 
when the number of randomly sampled entries is large enough. We will prove the following: 
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Theorem 1.1 Let M be an n\ x n 2 matrix of rank r sampled from the random orthogonal model, 
and put n = max(ni,n2). Suppose we observe m entries of M with locations sampled uniformly at 
random. Then there are numerical constants C and c such that if 

m > C n 5 ' 4 r log n , (1-6) 



the minimizer to the problem (1.5) is unique and equal to M with probability at least 1 — cn , that 



is to say, the semidefinite program (1.5) recovers all the entries of M with no error. In addition, 



if r < n 1 / 5 , then the recovery is exact with probability at least 1 — cn 3 provided that 

m > C n 6/5 r log n . (1.7) 

The theorem states that a surprisingly small number of entries are sufficient to complete a generic 
low-rank matrix. For small values of the rank, e.g. when r = 0(1) or r = O(logn), one only needs 
to see on the order of n 6//5 entries (ignoring logarithmic factors) which is considerably smaller than 
n 2 — the total number of entries of a squared matrix. The real feat, however, is that the recovery 
algorithm is tractable and very concrete. Hence the contribution is twofold: 

• Under the hypotheses of Theorem |1.1| there is a unique low-rank matrix which is consistent 
with the observed entries. 



Further, this matrix can be recovered by the convex optimization (1.5). In other words, for 



most problems, the nuclear norm relaxation is formally equivalent to the combinatorially hard 



rank minimization problem (1.3). 

Theorem |1.1| is in fact a special instance of a far more general theorem that covers a much larger 
set of matrices M. We describe this general class of matrices and precise recovery conditions in 
the next section. 

1.2 Main results 



As seen in our first example (1.1), it is impossible to recover a matrix which is equal to zero in 
nearly all of its entries unless we see all the entries of the matrix. To recover a low-rank matrix, 
this matrix cannot be in the null space of the sampling operator giving the values of a subset of the 
entries. Now it is easy to see that if the singular vectors of a matrix M are highly concentrated, 
then M could very well be in the null-space of the sampling operator. For instance consider the 
rank-2 symmetric matrix M given by 

2 



n/r ST * «1 =(ei + e 2 )/V2 

M = 2^o- k u k u k , 



,_ ( u 2 =(ei-e 2 )/V2 ; 

where the singular values are arbitrary. Then this matrix vanishes everywhere except in the top-left 
2x2 corner and one would basically need to see all the entries of M to be able to recover this 
matrix exactly by any method whatsoever. There is an endless list of examples of this sort. Hence, 
we arrive at the notion that, somehow, the singular vectors need to be sufficiently spread — that is, 
uncorrelated with the standard basis — in order to minimize the number of observations needed to 
recover a low-rank matrixj^] This motivates the following definition. 

2 Both the left and right singular vectors need to be uncorrelated with the standard basis. Indeed, the matrix e\v* 
has its first row equal to v and all the others equal to zero. Clearly, this rank-I matrix cannot be recovered unless 
we basically see all of its entries. 
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Definition 1.2 Let U be a subspace of M. n of dimension r and Pjj be the orthogonal projection 
onto U. Then the coherence of U (vis-a-vis the standard basis (ej)J is defined to be 

u(U) = - max \\Puei\\ 2 . (1.8) 

r \<i<n 

Note that for any subspace, the smallest n(U) can be is 1, achieved, for example, if U is spanned 
by vectors whose entries all have magnitude l/\/n. The largest possible value for n(U) is n/r 
which would correspond to any subspace that contains a standard basis element. We shall be 
primarily interested in subspace with low coherence as matrices whose column and row spaces have 
low coherence cannot really be in the null space of the sampling operator. For instance, we will see 
that the random subspaces discussed above have nearly minimal coherence. 

To state our main result, we introduce two assumptions about an n\ x n2 matrix M whose 
SVD is given by M — "Yli<k<r ® k u k v % an d with column and row spaces denoted by U and V 
respectively. 

AO The coherences obey max(/^(Z7), fJ-(V)) < /J-o f° r some positive hq. 



Al The n\ x rt2 matrix Y2i<k<r u^v^ has a maximum entry bounded by ^i \/ r /( n i n 2) hi absolute 
value for some positive m. 

The /i's above may depend on r and n\,n2. Moreover, note that Al always holds with [i\ = /xq \[r 
since the (i,j)th entry of the matrix ^2i < / : <r u k v k 1S gi ven by ^2i<k<r u ik v jk and by the Cauchy- 
Schwarz inequality, 



Kk<r 




\Vjk 



I 2 < 



\Jn\ni 

Kk<r v 



Hence, for sufficiently small ranks, [i\ is comparable to [Lq. As we will see in Section [2j for larger 
ranks, both subspaces selected from the uniform distribution and spaces constructed as the span 
of singular vectors with bounded entries are not only incoherent with the standard basis, but also 
obey Al with high probability for values of [i\ at most logarithmic in n\ and/or n-i- Below we will 
assume that \i\ is greater than or equal to 1. 

We are in the position to state our main result: if a matrix has row and column spaces that are 
incoherent with the standard basis, then nuclear norm minimization can recover this matrix from 
a random sampling of a small number of entries. 

Theorem 1.3 Let M be ann\XU2 matrix of rank r obeying AO and Al and put n = max(ni, ri2)- 
Suppose we observe m entries of M with locations sampled uniformly at random. Then there exist 
constants C , c such that if 

m > Cmax(/xf , ^q^/^i, /io?i- 1//4 ) nr(f3logn) (1-9) 



for some (3 > 2, then the minimizer to the problem (|L5| is unique and equal to M with probability 
at least 1 — cn~@ . For r < /ip 1 n 1 / 5 this estimate can be improved to 

m > C/x n 6/5 r(/3 log n) (1.10) 

with the same probability of success. 
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Theorem 1.3 asserts that if the coherence is low, few samples are required to recover M. For 
example, if = 0(1) an d the rank is not too large, then the recovery is exact with large probability 
provided that 

m > Cn 6/5 r logn. (1.11) 

We give two illustrative examples of matrices with incoherent column and row spaces. This list is 
by no means exhaustive. 

1. The first example is the random orthogonal model. For values of the rank r greater than 
logn, fJ,(U) and (i(V) are 0(1), m = O(logn) both with very large probability. Hence, the 



recovery is exact provided that m obeys ( |1.6[ ) or (1.7). Specializing Theorem 1.3 to these 
values of the parameters gives Theorem |1.1| Hence, Theorem |1.1| is a special case of our 
general recovery result. 

The second example is more general and, in a nutshell, simply requires that the components 
of the singular vectors of M are small. Assume that the Uj and Vj's obey 

ma,x\{ei,Uj)\ 2 < n B /n, max|(ej,^-)| 2 < fis/n, (1-12) 

for some value of [Ib = O(l). Then the maximum coherence is at most [ib since n(U) < 
and n{V) < hb- Further, we will see in Section [2] that Al holds most of the time with 



111 = 0{yJ\ogn). Thus, for matrices with singular vectors obeying (1.12), the recovery is 
exact provided that m obeys ( |1.11[ ) for values of the rank not exceeding / u^ 1 n 1//5 . 

1.3 Extensions 



Our main result (Theorem 1.3) extends to a variety of other low-rank matrix completion problems 
beyond the sampling of entries. Indeed, suppose we have two orthonormal bases fi, ■ ■ ■ , f n an d 
gi, . . . ,g n of W 1 , and that we are interested in solving the rank minimization problem 



minimize rank(.X") 

subject to f*Xgj = f*Mgj, e O, 



(1.13) 



This comes up in a number of applications. As a motivating example, there has been a great deal of 
interest in the machine learning community in developing specialized algorithms for the multiclass 
and multitask learning problems (see, e.g., [1,3,5]). In multiclass learning, the goal is to build 
multiple classifiers with the same training data to distinguish between more than two categories. 
For example, in face recognition, one might want to classify whether an image patch corresponds 
to an eye, nose, or mouth. In multitask learning, we have a large set of data, but have a variety of 
different classification tasks, and, for each task, only partial subsets of the data are relevant. For 
instance, in activity recognition, we may have acquired sets of observations of multiple subjects 
and want to determine if each observed person is walking or running. However, a different classifier 
is to be learned for each individual, and it is not clear how having access to the full collection 
of observations can improve classification performance. Multitask learning aims precisely to take 
advantage of the access to the full database to improve performance on the individual tasks. 

In the abstract formulation of this problem for linear classifiers, we have K classes to distin- 
guish and are given training examples fx, . . . , f n . For each example, we are given partial labeling 
information about which classes it belongs or does not belong to. That is, for each example fj 
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and class k, we may either be told that fj belongs to class k, be told fj does not belong to class 
k, or provided no information about the membership of fj to class k. For each class 1 < k < K, 
we would like to produce a linear function w k such that wtfi > if fi belongs to class k and 
w kfi < otherwise. Formally, we can search for the vector Wk that satisfies the equality con- 
straints w^fi = yik where = 1 if we are told that fi belongs to class k, yn~ = —1 if we are 
told that fi does not belong to class k, and yn~ unconstrained if we are not provided information. 
A common hypothesis in the multitask setting is that the w k corresponding to each of the classes 
together span a very low dimensional subspace with dimension significantly smaller than K [1,3,5]. 
That is, the basic assumption is that 

W = [W!, . . .,W K ] 



is low-rank. Hence, the multiclass learning problem can be cast as (1.13) with observations of the 
form f*Wej. 



To see that our theorem provides conditions under which (1.13) can be solved via nuclear norm 
minimization, note that there exist unitary transformations F and G such that &j = Ffj and 
e,j = Ggj for each j = 1, . . . , n. Hence, 

f*X gj = e*(FXG*) ej . 

Then if the conditions of Theorem 11.31 hold for the matrix FXG*, it is immediate that nuclear 



norm minimization finds the unique optimal solution of (1.13 ) when we are provided a large enough 
random collection of the inner products f*Mgj. In other words, all that is needed is that the 
column and row spaces of M be respectively incoherent with the basis (fi) and (gi). 

From this perspective, we additionally remark that our results likely extend to the case where 
one observes a small number of arbitrary linear functionals of a hidden matrix M. Set N = n 2 
and A\ , . . . , An be an orthonormal basis for the linear space ofnxn matrices with the usual 
inner product (X ,Y) = trace(X*Y). Then we expect our results should also apply to the rank 
minimization problem 

minimize rank(X) , 
subject to (A k ,X) = (A k ,M) fcefi, ^ ' ' 



where O C {1, . . . , ./V} is selected uniformly at random. In fact, (1.14) is (1.3) when the orthobasis 
is the canonical basis (eie|)i<jj< n . Here, those low-rank matrices which have small inner product 
with all the basis elements A k may be recoverable by nuclear norm minimization. To avoid unnec- 
essary confusion and notational clutter, we leave this general low-rank recovery problem for future 
work. 



1.4 Connections, alternatives and prior art 

Nuclear norm minimization is a recent heuristic introduced by Fazel in [18] , and is an extension of 
the trace heuristic often used by the control community, see e.g. [6,26]. Indeed, when the matrix 
variable is symmetric and positive semidefinite, the nuclear norm of X is the sum of the (nonneg- 
ative) eigenvalues and thus equal to the trace of X. Hence, for positive semidefinite unknowns, 
( |1.5[ ) would simply minimize the trace over the constraint set: 

minimize trace(X) 

subject to Xij = Mij G O . 

x y o 
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This is a semidefinite program. Even for the general matrix M which may not be positive definite or 
even symmetric, the nuclear norm heuristic can be formulated in terms of semidefinite programming 



as, for instance, the program (1.5) is equivalent to 



minimize trace(Wi) + trace(W2y 
subject to Xij = Mij £ 12 

'Wi X 
X* Wo 



y o 



with optimization variables X, W\ and W2, (see, e.g., [18,35]). There are many efficient algorithms 
and high-quality software available for solving these types of problems. 

Our work is inspired by results in the emerging field of compressive sampling or compressed 
sensing, a new paradigm for acquiring information about objects of interest from what appears to 
be a highly incomplete set of measurements [11, 13, 17]. In practice, this means for example that 
high- resolution imaging is possible with fewer sensors, or that one can speed up signal acquisition 
time in biomedical applications by orders of magnitude, simply by taking far fewer specially coded 
samples. Mathematically speaking, we wish to reconstruct a signal x G W 1 from a small number 
measurements y = &x, y £ M m , and m is much smaller than n; i.e. we have far fewer equations 
than unknowns. In general, one cannot hope to reconstruct x but assume now that the object we 
wish to recover is known to be structured in the sense that it is sparse (or approximately sparse). 
This means that the unknown object depends upon a smaller number of unknown parameters. 
Then it has been shown that l\ minimization allows recovery of sparse signals from remarkably few 
measurements: supposing $ is chosen randomly from a suitable distribution, then with very high 
probability, all sparse signals with about k nonzero entries can be recovered from on the order of 
fclogn measurements. For instance, if x is fc-sparse in the Fourier domain, i.e. a; is a superposition 
of k sinusoids, then it can be perfectly recovered with high probability — by l\ minimization — from 
the knowledge of about Adogn of its entries sampled uniformly at random [11]. 

From this viewpoint, the results in this paper greatly extend the theory of compressed sensing 
by showing that other types of interesting objects or structures, beyond sparse signals and images, 
can be recovered from a limited set of measurements. Moreover, the techniques for proving our 
main results build upon ideas from the compressed sensing literature together with probabilistic 
tools such as the powerful techniques of Bourgain and of Rudelson for bounding norms of operators 
between Banach spaces. 

Our notion of incoherence generalizes the concept of the same name in compressive sampling. 
Notably, in [10], the authors introduce the notion of the incoherence of a unitary transformation. 
Letting U be an n x n unitary matrix, the coherence of U is given by 

[J>{U) = nm&x\Ujk\ 2 ■ 

This quantity ranges in values from 1 for a unitary transformation whose entries all have the same 
magnitude to n for the identity matrix. Using this notion, [10] showed that with high probability, 
a /c-sparse signal could be recovered via linear programming from the observation of the inner 
product of the signal with m = Q(/i(C/)A; logn) randomly selected columns of the matrix U. 
This result provided a generalization of the celebrated results about partial Fourier observations 
described in [11], a special case where fJ-(U) = 1. This paper generalizes the notion of incoherence 
to problems beyond the setting of sparse signal recovery. 
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In [27], the authors studied the nuclear norm heuristic applied to a related problem where 
partial information about a matrix M is available from m equations of the form 



<A«Af> = Y^AfMij = b k , k = l,...,m, (1.15) 

ij 

where for each k, {-^J }»j is an i.i.d. sequence of Gaussian or Bernoulli random variables and 
the sequences {A^} are also independent from each other (the sequences {A^} and are 
available to the analyst). Building on the concept of restricted isometry introduced in [12] in 
the context of sparse signal recovery, [27] establishes the first sufficient conditions for which the 
nuclear norm heuristic returns the minimum rank element in the constraint set. They prove that 
the heuristic succeeds with large probability whenever the number m of available measurements is 
greater than a constant times 2nrlogn for n x n matrices. Although this is an interesting result, a 
serious impediment to this approach is that one needs to essentially measure random projections of 
the unknown data matrix — a situation which unfortunately does not commonly arise in practice. 



Further, the measurements in (1.15) give some information about all the entries of M whereas 
in our problem, information about most of the entries is simply not available. In particular, the 
results and techniques introduced in [27] do not begin to address the matrix completion problem of 
interest to us in this paper. As a consequence, our methods are completely different; for example, 
they do not rely on any notions of restricted isometry. Instead, as we discuss below, we prove 



the existence of a Lagrange multiplier for the optimization (1.5) that certifies the unique optimal 
solution is precisely the matrix that we wish to recover. 

Finally, we would like to briefly discuss the possibility of other recovery algorithms when the 
sampling happens to be chosen in a very special fashion. For example, suppose that M is generic 
and that we precisely observe every entry in the first r rows and columns of the matrix. Write M 
in block form as 

" M n M12 
M 2 \ M 22 



M 



with Mn an r x r matrix. In the special case that M\\ is invertible and M has rank r, then it is 
easy to verify that M 22 = M 2 iM^ Mi 2 . One can prove this identity by forming the SVD of M, 
for example. That is, if M is generic, and the upper r x r block is invertible, and we observe every 
entry in the first r rows and columns, we can recover M. This result immediately generalizes to the 
case where one observes precisely r rows and r columns and the r x r matrix at the intersection of 
the observed rows and columns is invertible. However, this scheme has many practical drawbacks 
that stand in the way of a generalization to a completion algorithm from a general set of entries. 
First, if we miss any entry in these rows or columns, we cannot recover M, nor can we leverage 
any information provided by entries of M 22 . Second, if the matrix has rank less than r, and we 
observe r rows and columns, a combinatorial search to find the collection that has an invertible 
square sub-block is required. Moreover, because of the matrix inversion, the algorithm is rather 
fragile to noise in the entries. 

1.5 Notations and organization of the paper 

The paper is organized as follows. We first argue in Section [2] that the random orthogonal model 
and, more generally, matrices with incoherent column and row spaces obey the assumptions of the 



general Theorem|1.3| To prove Theorem 1.3, we first establish sufficient conditions which guarantee 
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that the true low-rank matrix M is the unique solution to ( 1.5 ) in Section |3| One of these conditions 
is the existence of a dual vector obeying two crucial properties. Section [4] constructs such a dual 
vector and provides the overall architecture of the proof which shows that, indeed, this vector obeys 
the desired properties provided that the number of measurements is sufficiently large. Surprisingly, 
as explored in Section [5] the existence of a dual vector certifying that M is unique is related to 
some problems in random graph theory including "the coupon collector's problem." Following this 
discussion, we prove our main result via several intermediate results which are all proven in Section 
[6] Section [7] introduces numerical experiments showing that matrix completion based on nuclear 
norm minimization works well in practice. Section [8] closes the paper with a short summary of 
our findings, a discussion of important extensions and improvements. In particular, we will discuss 



possible ways of improving the 1.2 exponent in (1.10) so that it gets closer to 1. Finally, the 
Appendix provides proofs of auxiliary lemmas supporting our main argument. 

Before continuing, we provide here a brief summary of the notations used throughout the 
paper. Matrices are bold capital, vectors are bold lowercase and scalars or entries are not bold. For 
instance, X is a matrix and Xij its (i, j)th entry. Likewise a: is a vector and %\ its ith component. 
When we have a collection of vectors G 1" for 1 < k < d, we will denote by Uik the ith 
component of the vector Uf. and [u\, . . . , Uj] will denote the n x d matrix whose kth column is u^. 

A variety of norms on matrices will be discussed. The spectral norm of a matrix is denoted 
by \\X\\. The Euclidean inner product between two matrices is (X,Y) = trace(X*Y), and the 
corresponding Euclidean norm, called the Frobenius or Hilbert-Schmidt norm, is denoted \\X \\p. 
That is, ||^||f = (X , X) 1 / 2 . The nuclear norm of a matrix X is ||^||*. The maximum entry of 
X (in absolute value) is denoted by ||-X^||oo = maxjj |Xjd. For vectors, we will only consider the 
usual Euclidean £2 norm which we simply write as ||£c||. 

Further, we will also manipulate linear transformation which acts on matrices and will use 
caligraphic letters for these operators as in A(X). In particular, the identity operator will be 
denoted by 2. The only norm we will consider for these operators is their spectral norm (the top 
singular value) denoted by ||^4|| = sup X: ||x|| F <i 11-4(^0 ll-F- 

Finally, we adopt the convention that C denotes a numerical constant independent of the matrix 
dimensions, rank, and number of measurements, whose value may change from line to line. Certain 
special constants with precise numerical values will be ornamented with subscripts (e.g., Cr). Any 
exceptions to this notational scheme will be noted in the text. 

2 Which matrices are incoherent? 

In this section we restrict our attention to square n x n matrices, but the extension to rectangular 
ni x ri2 matrices immediately follows by setting n = max(ni,n2). 

2.1 Incoherent bases span incoherent subspaces 

Almost all n x n matrices M with singular vectors {v,k}i<k<r an d {vk}i<k<r obeying the size 



property (1.12) also satisfy the assumptions AO and Al with fiQ = he, Hi = CfiBV^°S n f° r some 
positive constant C. As mentioned above, AO holds automatically, but, observe that Al would not 
hold with a small value of [i\ if two rows of the matrices \u\ , . . . , u r ] and [v\ , . . . , v r ] are identical 
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with all entries of magnitude WJibJu since it is not hard to see that in this case 

II 5J«fc«fc[|oo = VBr/n. 

k 

Certainly, this example is constructed in a very special way, and should occur infrequently. We 
now show that it is generically unlikely. 
Consider the matrix 

r 

^2e k u k v%, (2.1) 

k=l 

where {e k }i<k<r is an arbitrary sign sequence. For almost all choices of sign sequences, Al is 
satisfied with [i\ = 0(/j,b V^°g n )- Indeed, if one selects the signs uniformly at random, then for 
each > 0, 

r 

PGI^ejfcUfcWfcHoo >/JB y/8/3rlogn/n) < (2n 2 )n" /3 . (2.2) 
fc=i 

This is of interest because suppose the low-rank matrix we wish to recover is of the form 



M = J2 x kU k v* k (2.3) 



k=\ 

with scalars A/%. Since the vectors {u k } and {v k } are orthogonal, the singular values of M are 
given by \X k \ and the singular vectors are given by sgn(Afc)wfc and v k for k = 1, . . . , r. Hence, in 
this model Al concerns the maximum entry of the matrix given by (2.1) with e k = sgn(Ayt). That 
is to say, for most sign patterns, the matrix of interest obeys an appropriate size condition. We 
emphasize here that the only thing that we assumed about the u k s and v k s was that they had 
small entries. In particular, they could be equal to each other as would be the case for a symmetric 
matrix. 

The claim (2.2) is a simple application of Hoeffding's inequality. The (i,j)th entry of (2.1) is 
given by 

%ij = e k u ik v jki 
l<k<r 

and is a sum of r zero-mean independent random variables, each bounded by /is/re. Therefore, 

> XfJ, B Vr/n) < 2e~ A2/8 . 

Setting A proportional to ylogn and applying the union bound gives the claim. 

To summarize, we say that M is sampled from the incoherent basis model if it is of the form 

r 

M = J2 e kO- k u k v* k ; (2.4) 

k=l 

{efc}i<fc<r is a random sign sequence, and {u k }i< k <r and {i>fc}i<fc< r have maximum entries of size 



at most \/^b/ 



n. 



Lemma 2.1 There exist numerical constants c and C such that for any (3 > 0, matrices from the 
incoherent basis model obey the assumption Al with fi\ < C[ib\J {P + 2) log n with probability at 
least 1 — cn~@. 
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2.2 Random subspaces span incoherent subspaces 



In this section, we prove that the random orthogonal model obeys the two assumptions AO and 
Al (with appropriate values for the /u's) with large probability. 

Lemma 2.2 Set r = max(r, logn). Then there exist constants C and c such that the random 
orthogonal model obeys^\ 

1. maxj ||P[/ej|| 2 < Cf/n, 

2 - II Y,i<k<r u k v k\\°° < C lognVf/n. 

with probability 1 — cn _3 logn. 

We note that an argument similar to the following proof would give that if C of the form K(3 where 



K is a fixed numerical constant, we can achieve a probability at least 1 



cn 



provided that n is 



sufficiently large. To establish these facts, we make use of the standard result below [21]. 

Lemma 2.3 Let Y d be distributed as a chi-squared random variable with d degrees of freedom. Then 
for each t > 



P(Y d -d> tV2d + t 2 ) < e~* /2 and 



d< -t V2d) < e 



-t 2 /2 



(2.5) 



We will use (2.5) as follows: for each e G (0, 1) we have 
P(Y d >d(l- ey 1 ) < e~ e2d/A and 



< d(l - e)) < e 



-e 2 d/A 



(2.6) 



We begin with the second assertion of Lemma 2.2 since it will imply the first as well. Observe 
that it follows from 



\Piiei 



(2.7) 



Kk<r 



that Z r = \\Pi/ei\\ 2 (a is fixed) is the squared Euclidean length of the first r components of a unit 
vector uniformly distributed on the unit sphere in n dimensions. Now suppose that xi,X2, . . . ,x n 
are i.i.d. iV(0, 1). Then the distribution of a unit vector uniformly distributed on the sphere is 
that of a;/||a;|| and, therefore, the law of Z r is that of Y r /Y n , where Y r = ^2k<r x k- -^bc e > and 
consider the event A nyt = {Y n /n > 1 — e}. For each A > 0, it follows from (2.6) that 

¥{Z r - r/n > \y/2r/n) = F(Y r > [r + \V2r)Y n /n) 

< F(Y r >[r + \V2r~]Y n /n and A n>e ) + F(A^ e ) 

< ¥(Y r >[r + \V2r][l - e]) + e" e2n/4 

= ¥(Y r -r> AV2^[1 - e - ey^A 2 "]) + e"^ 4 . 

Now pick e = 4(n _1 logn) 1 / 2 , A = 8^2 log n and assume that n is sufficiently large so that 



e(l + Vr/2X 2 ) < 1/2. 



3 When r > C'(logn) 3 for some positive constant C\ a better estimate is possible, namely, || ^^i<fc< r WfcV^jjoo 

< 

C y/r log n/n. 
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Then 

F(Z r - r/n > \V2r/n) < F(Y r -r> (\/2)y/2r) + n -4 . 
Assume now that r > 41ogn (which means that A < 4y2r). Then it follows from (2.5) that 

F(Y r -r > (A/2)V2r) < F(Y r - r > (A/4) \Fh~ + (A/4) 2 ) < e" A2/32 = n" 4 . 

Hence 

F(Z r — r/n > 16 -y/r log n/n) < 2n -4 

and, therefore, 

P(max ||P[/ei|| 2 - r/n > lQ^r log n/n) < 2n~ 3 (2i 



by the union bound. Note that (2.8) establishes the first claim of the lemma (even for r < 41ogn 
since in this case Z r < Zu i og nl ) • 

It remains to establish the second claim. Notice that by symmetry, E = Y2i<k<r u k v k nas ^ e 
same distribution as 

r 
k=l 

where {e^} is an independent Rademacher sequence. It then follows from Hoeffding's inequality 
that conditional on {u k } and {v k } we have 

p(l^l>i)<2 e -* 2 /K, 4= <A 

\<k<r 

Our previous results indicate that | 2 < (10 log n)/n with large probability and thus 



■ m lo S n II d 
a~4 < 10 \\Pu e i 



lJ n 

Set f = max(r, logn). Since ||P[/e,|| 2 < Cr/n with large probability, we have 

ofj < C(logn)f/n 2 

with large probability. Hence the marginal distribution of Fij obeys 

F(\F i:j \ > XVr/n) < 2e- 7A2 / logn + P(4 > C(logn)f/n 2 ). 

for some numerical constant 7. Picking A = 7' logn where 7' is a sufficiently large numerical 
constant gives 

\\F\\oo < C(logn) Vr/n 
with large probability. Since E and F have the same distribution, the second claim follows. 

The claim about the size of max^ \ vij\ 2 is straightforward since our techniques show that for 
each A > 

P(Zi > A(log n)/n) < F(Y 1 > A(l - e)logn) + e" e2n/4 . 

Moreover, 

P(Ki > A(l-e)logn) =P(|xi| > x/A(l - e) logn) < 2 e -^ A(1 ~ e) logn . 
If n is sufficiently large so that e < 1/5, this gives F(Z\ > 10(log n)/n) < 3n~ 4 and, therefore, 

P(max \vij\ 2 > 10(log n)/n) < 12n _3 logn 



since the maximum is taken over at most 4nlogn pairs. 
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3 Duality 

Let 7Zq : M niXn2 — > be the sampling operator which extracts the observed entries, 7£o(-3T) 



(Xij)ij^Q, so that the constraint in (1.5 ) becomes TZq(X) = TZq(M). Standard convex optimization 



theory asserts that X is solution to (1.5) if there exists a dual vector (or Lagrange multiplier) 
A G Rl^l such that TZq A is a subgradient of the nuclear norm at the point X, which we denote by 

K* Q \ed\\X\U (3.1) 

(see, e.g. [7]). Recall the definition of a subgradient of a convex function / : R raiXn2 — » R. We say 
that Y is a subgradient of / at Xq, denoted Y E df(Xo), if 

f(X)>f(X ) + (Y,X-X ) (3.2) 

for all X. 

Suppose Xq E M™ lXn2 has rank r with a singular value decomposition given by 

x o= ^2 a kU k v* k , (3.3) 

l<fc<r 

With these notations, Y is a subgradient of the nuclear norm at Xq if and only if it is of the form 

Y= u k v* k + W, (3.4) 

l<fc<r 

where W obeys the following two properties: 

(i) the column space of W is orthogonal to U = span (u±, . . . , u r ), and the row space of W is 
orthogonal to V = span (v%, . . . , v r ); 

(ii) the spectral norm of W is less than or equal to 1. 

(see, e.g., [23,36]). To express these properties concisely, it is convenient to introduce the orthogonal 
decomposition M. niXn2 = TQT 1 - where T is the linear space spanned by elements of the form u k x* 
and yv k , 1 < k < r, where x and y are arbitrary, and T 1 - is its orthogonal complement. Note that 
dim(T) = r{n\ + n<i — r), precisely the number of degrees of freedom in the set of n\ x n<i matrices 
of rank r. T 1 - is the subspace of matrices spanned by the family (xy*), where x (respectively y) is 
any vector orthogonal to U (respectively V). 

The orthogonal projection Vt onto T is given by 

P T (X) = P V X + XP V - P V XP V , (3.5) 

where Pjj and Py are the orthogonal projections onto U and V. Note here that while Pjj and Py 
are matrices, Vt is a linear operator mapping matrices to matrices. We also have 

V T ±(X) = (1- V T ){X) = (J ni - Pu)X(I n2 - Py) 

where Id denotes the d x d identity matrix. With these notations, Y G <9||_X^o||* if 

(i>) ^(10 = ^^ 
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(ii') and \\V T ±Y\\ < 1. 

Now that we have characterized the subgradient of the nuclear norm, the lemma below gives 
sufficient conditions for the uniqueness of the minimizer to (1.5). 

Lemma 3.1 Consider a matrix Xq = s ^Jk=i <J k u k v % °f ran ^ r which is feasible for the problem 
(1.5), and suppose that the following two conditions hold: 

TZqX obeys 



1. there exists a dual point X such that Y 

V T {Y) 



k=i 



IITVOnil < i; 



(3.6) 



2. the sampling operator TZq, restricted to elements in T is injective. 

Then Xq is the unique minimizer. 

Before proving this result, we would like to emphasize that this lemma provides a clear strategy 
for proving our main result, namely, Theorem 1.3 Letting M = Y^'k=i a k u k v ^ M is the unique 



solution to (1.5) if the injectivity condition holds and if one can find a dual point A such that 
Y = TZ* n X obeys Q. 

The proof of Lemma 3.1 uses a standard fact which states that the nuclear norm and the spectral 
norm are dual to one another. 

Lemma 3.2 For each pair W and H , we have 

(W,H) < \\W\\ \\H\\*. 

In addition, for each H, there is a W obeying \\W\\ = 1 which achieves the equality. 

A variety of proofs are available for this Lemma, and an elementary argument is sketched in [27]. 



We now turn to the proof of Lemma 3.1 



Proof [of Lemma 3.1 Consider any perturbation Xq + H where TZ^(H) = 0. Then for any W° 
obeying (i)-(ii), Ylk=i u k v t + i s a subgradient of the nuclear norm at Xq and, therefore, 



\X + H\l > ||Xo|L + <E + W ^ H ) 



k=l 



Letting W = V T ±(Y), we may write Xyfc=i ^k^k 
it then follows that 

\\X + H\l > \\X \l + (W° 

Now by construction 

(W° -W,H) = (V T ±{W° -W),H) = (W° 

We use Lemma 3.2 and set W° = Vt±(Z) where Z is any matrix obeying \\Z\\ < 1 and 

„. Then W° G T- 1 , [|W°|| < 1, and 

(W°-W,H}>(1-\\W\\)\\V T 4H)\\*, 



TZ* n X - W. Since \\W\\ < 1 and Kq(H) 
W.H). 

W,P T ±(H)). 



{Z,V T ±(H)) = \\V T ±(H) 



oil* 



which by assumption is strictly positive unless V T x(H) = 0. In other words, ||-X"o + H [|* > ||Jf| 
unless V T ±(H) = 0. Assume then that V T ±(H) = or equivalently that H G T. Then Kn(H) = 
implies that H = by the injectivity assumption. In conclusion, || A"o+i3"||* > ||-X"||* unless H = 0. 
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4 Architecture of the proof 



Our strategy to prove that M = Xa<fco a kUkV* k is the unique minimizer to (1.5) is to construct a 



matrix Y which vanishes on C and obeys the conditions of Lemma 3.1 (and show the injectivity 
of the sampling operator restricted to matrices in T along the way) . Set Vn to be the orthogonal 
projector onto the indices in $7 so that the (i, j)th component of Vq(X) is equal to X{j if £ £1 
and zero otherwise. Our candidate Y will be the solution to 

minimize II^IIf (a -,\ 

subject to (V T V n )(X) = J2 r k=1 u k v* k . ( " j 

The matrix Y vanishes on £l c as otherwise it would not be an optimal solution since Vn(Y) 
would obey the constraint and have a smaller Frobenius norm. Hence Y = Vn(Y) and Vt(Y) = 
Ylk=i u kV* k . Since the Pythagoras formula gives 



\n 2 F = \\r T (Y)f F + \\v T ±on\\ 2 F = ii E"*"*^ + wpt±oo\\ 2 f 

k=l 

= r + \\V T 4Y)\\ F , 



minimizing the Frobenius norm of X amounts to minimizing the Frobenius norm of V T ±(X) under 
the constraint Vt(X) = X^fc=i u k v t- Our motivation is twofold. First, the solution to the least- 
squares problem ( |4.1| ) has a closed form that is amenable to analysis. Second, by forcing V T ±(Y) 
to be small in the Frobenius norm, we hope that it will be small in the spectral norm as well, and 



establishing that ((^±(1^)11 < 1 would prove that M is the unique solution to (1.5) 
To compute the solution to ( |4.1[ ), we introduce the operator Aqt defined by 

Aar(M) =V n V T (M). 



Then, if Aq T Asit = VtVqVt has full rank when restricted to T, the minimizer to (4.1 ) is given by 



Y = AnT(A* nT AnT)-\E), E^Yl u * v t- ( 42 ) 



k=l 



We clarify the meaning of (4.2) to avoid any confusion. (Aq T Aqt) (E) is meant to be that 
element F in T obeying (A^ t Aqt)(F) = E. 
To summarize the aims of our proof strategy, 

• We must first show that Aq T A^t = VtV^Vt is a one-to-one linear mapping from T onto 
itself. In this case, Aut = T^qPt — as a mapping from T to M niXn2 — is injective. This is 



the second sufficient condition of Lemma 3.1 Moreover, our ansatz for Y given by (4.2) is 
well-defined. 

Having established that Y is well-defined, we will show that 

ii7vonn<i, 

thus proving the first sufficient condition. 
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4.1 The Bernoulli model 



Instead of showing that the theorem holds when O is a set of size m sampled uniformly at random, 
we prove the theorem for a subset O' sampled according to the Bernoulli model. Here and be- 
low, {<%}i<i< nij i<j< n2 is a sequence of independent identically distributed 0/1 Bernoulli random 
variables with 

TYl 

P(% = 1) =p= , (4.3) 

nin 2 

and define 

n' = {(i,j) :% = 1}. (4.4) 

Note that IE \Q'\ = m, so that the average cardinality of 0,' is that of Then following the same 
reasoning as the argument developed in Section II. C of [11] shows that the probability of 'failure' 
under the uniform model is bounded by 2 times the probability of failure under the Bernoulli model; 
the failure event is the event on which the solution to (1.5) is not exact. Hence, we can restrict our 
attention to the Bernoulli model and from now on, we will assume that f2 is given by (4.4). This is 
advantageous because the Bernoulli model admits a simpler analysis than uniform sampling thanks 
to the independence between the S^s. 

4.2 The injectivity property 

We study the injectivity of Aqt, which also shows that Y is well-defined. To prove this, we will 
show that the linear operator p~ 1 VT{'Pn — pI)Vt has small operator norm, which we recall is 
su P||x|| F <i p- x \\V T {Tn -pl)T T {X)\\ F . 



Theorem 4.1 Suppose VI is sampled according to the Bernoulli model (4.3) -(4.4) and put n = 
max(ni,ri2). Suppose that the coherences obey m&x([i(U) , [i(V)) < po- Then, there is a numerical 
constants Cr such that for all (3 > 1, 



P- 1 \\V T VnV T - P V T \\ < c M / /W/31ogn) 

m 



with probability at least 1 — 3n ^ provided that Cr \J ' ^° nr (^ lo g n ) < \ 
Proof Decompose any matrix X as X = ^2 a f,{X, e a e^)e a e^ so that 

V T (X) = Y,(PT(X),e a e* b )e a et = V T (e a e* b ))e a el 

ab ab 

Hence, V n V T {X) = E ab 6 ab (X ,V T (e a e* b )) e a e* b which gives 

(Wt)(X) = J2 6 - b (X,VT(e a et))V T (e a et). 

ab 

In other words, 

VtVoPt = Y, d ^T{e a e* b )0V T {e a e* b ). 

ab 



It follows from the definition (3.5) of Vt that 



V T (e a et) = (Pue a )e* b + e a {P v e b )* - (P C7 e a )(JVe 6 )*. (4.6) 
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This gives 

\\V T {e a eDf F = (V T (e a el),e a et) = \\Pue a f + \\P v e b \\ 2 - \\Pue a \\ 2 \\P V e b \\ 2 (4.7) 
and since ||-P[/e a || 2 < ii(U)r/n\ and 1 1 i"V e 6 1 1 2 < /•*(^O r / n 2; 

\\pT(e a e* b )\\ 2 F < 2/x r/min(n 1 ,n 2 ). (4.8) 
Now the fact that the operator PtPqPt does not deviate from its expected value 

^(V T VnV T ) = Vt{^Vq)V t = P T {pT)P T = pPr 

in the spectral norm is related to Rudelson's selection theorem [29] . The first part of the theorem 
below may be found in [10] for example, see also [30] for a very similar statement. 

Theorem 4.2 [10] Let {5 ab } be independent 0/1 Bernoulli variables with P(5 ab = 1) = p = 
and put n = max(ni,n2). Suppose that W'PTi^a^DWp < 2/j,Qr/n. Set 

Z = P^ 1 II - P) V T{e a e* b ) ® V T (e a e* b )\\ = p~ l \\P T PnP T ~ pPrl 

ab 

1. There exists a constant C' R such that 



EZ<C> R J>^n (49) 



provided that the right-hand side is smaller than 1. 
2. Suppose EZ < 1. Then for each A > 0, we have 



r(| Z -E Z |>A^™^j <3e X p(- 7 ; m m{^logn,A^^}j (4.10) 
for some positive constant j' . 



As mentioned above, the first part, namely, (4.9) is an application of an established result which 
states that if {y.i} is a family of vectors in W 1 and {<5j} is a 0/1 Bernoulli sequence with P(5i = 1) = p, 
then 

P^ 1 \\y^X^i -p)Vi®Vi\\ < max \\yi\\ 

i V P 1 

for some C > provided that the right-hand side is less than 1. The proof may be found in the 
cited literature, e.g. in [10]. Hence, the first part follows from applying this result to vectors of 
the form PT(e a e* b ) and using the available bound on ||7 ? r(e a ej|)||.F. The second part follows from 
Talagrand's concentration inequality and may be found in the Appendix. 

Set A = y/P/jQ and assume that m > (/3/7q)/zq nrlogn. Then the left-hand side of (4.10) is 
bounded by 3n _/3 and thus, we established that 

z <c , l ^onr logn 1 Up nr f3logn 
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with probability at least 1 — 3n 13 . Setting Cr = C' R + 1/ yTo finishes the proof. 

Take m large enough so that Cr yVo (nr/m) log n < 1/2. Then it follows from (4.5) that 



P -\\V T {X)\\ F < \\(V T rnV T )(X)\\ F < ^\\V T {X)\\ F (4.11) 

for all X with large probability. In particular, the operator Aq T Aqt = T't'Pq'Pt mapping T onto 
itself is well-conditioned and hence invertible. An immediate consequence is the following: 

Corollary 4.3 Assume that Cr \J Jconr (log n)/m < 1/2. With the same probability as in Theorem 
!7l we have 

\\VqPt{X)\\ f < s/$pj2\\V T {X)\\ F . (4.12) 
Proof We have \\V n V T (X)\\ F = (X, (PciPt)*{PciPt)X) = (X, (V T V n V T )X) and thus 
\\VnV T (X)\\ F = (V T X,(V T VnP T )X} < \\V T (X)\\ F \\(V T VnP T )(X)\\ F , 



where the inequality is due to Cauchy-Schwarz. The conclusion (4.12) follows from (4.11). 



4.3 The size property 

In this section, we explain how we will show that ||'Pt ± (^)II < 1- This result will follow from five 
lemmas that we will prove in Section [6} Introduce 



which obeys ||H(X)||i? < Cr y/ fj,o(nr/m) piogn\\VT{X)\\ F with large probability because of The- 
For any matrix XeT, {VtVqPt)' 1 {X) can be expressed in terms of the power series 



orem 



4.1 



(V T VnV T r L {X) = P ~ L {X + H{X) + n\X) + ...) 

for H is a contraction when m is sufficiently large. Since Y = VnT J T(J > T'PnPT) l {Yl,i<k<r UkV *k)^ 
T T ±(Y) may be decomposed as 

V T ±(Y) = p -\v T ±VnV T )(E + H(E) + H 2 (E) + ...), E= ^ u kK- (4-13) 

l<fc<r 

To bound the norm of the left-hand side, it is of course sufficient to bound the norm of the summands 



in the right-hand side. Taking the following five lemmas together establishes Theorem 1.3 

Lemma 4.4 Fix j3 > 2 and A > 1. There is a numerical constant Co such that ifm > Xfxf nr(3 log n, 
then 

p- 1 \\(P T ±PnPr)E\\ < C A- 1 / 2 . (4.14) 

with probability at least 1 — . 

Lemma 4.5 Fix (3 > 2 and A > 1. There are numerical constants C\ and c\ such that if m > 
A fi\ max( 1 Vjixo, fJ>i) nr(3 log n, then 



P' 1 \\(V T ±VnV T )H{E)\\ < d A" 1 (4.15) 
with probability at least 1 — c\n~^ . 
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Lemma 4.6 Fix (3 > 2 and A > 1. There are numerical constants C2 and C2 such that if m > 
A /i^ 3 rer 4//3 /3 log n, then 



P 



-1 



\\(V T ±VnP T )H 2 {E)\\ <C 2 A- 3 / 2 



(4.16) 



with probability at least 1 — C2n P. 



Lemma 4.7 Fix (3 > 2 and A > 1. There are numerical constants C3 and C3 suc/i i/iai if m > 
A^q nr 2 /3 log n, i/ien 



-1 



||(P T ^nPr)W 3 (^)|| ^CsA- 1 / 2 



(4.17) 



with probability at least 1 — c^n P. 



Lemma 4.8 Under the assumptions of Theorem 4-1 there is a numerical constant Ck such that 
if m > (2Cr) 2 nonr /3 'log n, then 



k>k() 



2 \ 1/2 

n r x 



Honrfi logn^ fe °^ 2 



m 



(4.18) 



with probability at least 1 — n P. 



Let us now show how we may combine these lemmas to prove our main results. Under all of the 
assumptions of Theorem 1.3 consider the four Lemmas 4.4 4.5 4.6 and 4.8 the latter applied with 
ko = 3. Together they imply that there are numerical constants c and C such that ||"Pr±(^)|| < 1 
with probability at least 1 — cn~@ provided that the number of samples obeys 



m > C max(/ij, y^J 2 n\, fi^r 1 ^ 3 , ji^n 1 / 4 ) nr(5 log n 



(4.19) 



for some constant C. The four expressions in the maximum come from Lemmas 4.4 4.5 |4.6| and 
4.8 in this order. Now the bound (4.19) is only interesting in the range when /xora 1//4 r is smaller 



than a constant times n as otherwise the right-hand side is greater than n 2 (this would say that one 
would see all the entries in which case our claim is trivial). When fi^r < n 3 / 4 , (/io r ) 4 ^ 3 < yon 5 ^r 
and thus the recovery is exact provided that m obeys ( |1.9[ ). 

For the case concerning small values of the rank, we consider all five lemmas and apply Lemma 
4.8, the latter applied with fco = 4. Together they imply that ||"Pt x (^)II < 1 with probability at 
P provided that the number of samples obeys 



least 1 



cn 



m > C max(//gr, yon 1 / 5 ) nrf3 log n 



(4.20) 



for some constant C. The two expressions in the maximum come from Lemmas 4.7 and |4.8| in this 
order. The reason for this s imp lifie d formulation is that the terms ^ 2 , ^Lq 2 ji\ and /Xg^r 1 / 3 which 



come from Lemmas 



4.4 



4.5 



and 



4.6 



are bounded above by /^r since y\ < yoV^- When yor < n 



1/5 



the recovery is exact provided that m obeys (1.10). 



5 Connections with Random Graph Theory 

5.1 The injectivity property and the coupon collector's problem 

We argued in the Introduction that to have any hope of recovering an unknown matrix of rank 1 
by any method whatsoever, one needs at least one observation per row and one observation per 
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column. Sample m entries uniformly at random. Viewing the row indices as bins, assign the kth 
sampled entry to the bin corresponding to its row index. Then to have any hope of recovering our 
matrix, all the bins need to be occupied. Quantifying how many samples are required to fill all of 
the bins is the famous coupon collector's problem. 

Coupon collection is also connected to the injectivity of the sampling operator Vq, restricted to 
elements in T. Suppose we sample the entries of a rank 1 matrix equal to xy* with left and right 
singular vectors u = a;/||a;|| and v = y/\\y\\ respectively and have not seen anything in the ith 
row. Then we claim that Vq (restricted to T) has a nontrivial null space and thus VtPqPt is n °t 
invertible. Indeed, consider the matrix e^u*. This matrix is in T and 

Vn{eiv*) = 

since e{V* vanishes outside of the ith row. The same applies to the columns as well. If we have not 
seen anything in column j, then the rank-1 matrix ue*j G T and Vn(ue*) = 0. In conclusion, the 
invertibility of VtVqVt implies a complete collection. 

When the entries are sampled uniformly at random, it is well known that one needs on the 



order of nlogn samples to sample all the rows. What is interesting is that Theorem 4.1 implies 



that VtPs(Pt is invertible — a stronger property — when the number of samples is also on the order 



of nlogn. A particular implication of this discussion is that the logarithmic factors in Theorem 4.1 
are unavoidable. 



5.2 The injectivity property and the connectivity problem 

To recover a matrix of rank 1, one needs much more than at least one observation per row and 
column. Let R be the set of row indices, 1 < i < n, and C be the set of column indices, 1 < j < n, 
and consider the bipartite graph connecting vertices i G R to vertices j G C if and only if (i, j) G Q, 
i.e. the (z, j)th entry is observed. We claim that if this graph is not fully connected, then one cannot 
hope to recover a matrix of rank 1 . 

To see this, we let / be the set of row indices and J be the set of column indices in any 
connected component. We will assume that / and J are nonempty as otherwise, one is in the 
previously discussed situation where some rows or columns are not sampled. Consider a rank 1 
matrix equal to xy* as before with singular vectors u = a;/[|a;|| and v = y/\\y\\. Then all the 
information about the values of the Xi's with i £ I and of the y/s with j £ J are given by the 
sampled entries connecting I to J since all the other observed entries connect vertices in I c to those 
in J c . Now even if one observes all the entries XiUj with i G I and j G J, then at least the signs of 
Xi, i G I, and of yj, j G J, would remain undetermined. Indeed, if the values (xj)j e /, {yj)jeJ are 
consistent with the observed entries, so are the values (— Xi)^, (-yj)j^j. However, since the same 
analysis holds for the sets 1° and J°, there are at least two matrices consistent with the observed 
entries and exact matrix completion is impossible. 

The connectivity of the graph is also related to the injectivity of the sampling operator Vq, 
restricted to elements in T. If the graph is not fully connected, then we claim that Vq. (restricted 
to T) has a nontrivial null space and thus Vt'PqVt is not invertible. Indeed, consider the matrix 

M = av* + ub*, 

where a\ = —U{ if i G / and Oj = U{ otherwise, and bj = Vj if j G J and bj = —Vj otherwise. Then 
this matrix is in T and obeys 

M« = 
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if (i, j) £ I x J or (i, j) E I c x J c . Note that on the complement, i.e. (i, j) £ I x J c or (i , j) £ I c x J, 
one has Mjj = 2ui«j and one can show that M ^ unless uv* = 0. Since Q is included in the 
union of I x J and I c x J c , we have that Vq(M) = 0. In conclusion, the invertibility of VtVqPt 
implies a fully connected graph. 

When the entries are sampled uniformly at random, it is well known that one needs on the order 
of nlogra samples to obtain a fully connected graph with large probability (see, e.g., [8]). Remark- 
ably, Theorem |4.1| implies that VtVqPt is invertible — a stronger property — when the number of 
samples is also on the order of nlogn. 

6 Proofs of the Critical Lemmas 



In this section, we prove the five lemmas of Section 4.3 Before we begin, however, we develop a 
simple estimate which we will use throughout. For each pair (a, 6) and (a',6'), it follows from the 



expression of P;r(e e£) (4.6) that 

{T , T{e a ie* bl ),e a e* b ) = (e a ,Pue a/ ) l{ 6=6 / } + (e b ,P v e b >) l{ a=a / } - (e a , P v e a t){e b , Pyey). (6.1) 
Fix jiQ obeying fJ,(U) < /j,q and fJ>(V) < fiQ and note that 

\(e a ,Pue a >)\ = \(Pue a , Pue a >)\ < \\Pue a \\ \\Pue a '\\ < Mo^/ni 
and similarly for (e&, Pyey). Suppose that b = b' and a ^ a' , then 

\(PT(e a ie*y),e a e* b )\ = \(e a , Pjje a t)\{l - \\P v e b \\ 2 ) < nor/n\. 
We have a similar bound when a = a' and b ^b' whereas when a ^ a' and 6^6', 

\(VT(e a 'ey),e a e b )\ < (/x r) 2 / '(nin 2 ) . 



In short, it follows from this analysis (and from (4.8) for the case where (a, b) = (a', b')) that 



max \{VT(e a >e*y),e a e* b )\ < 2^ r/mm(ni, n 2 ). (6.2) 

ab,a'b' 

A consequence of ( |4.8[ ) is the estimate: 

^|(P r (e a , e ^),e (ie D| 2 = ^KPT(e^), ea ,e^)| 2 

a'b' a'b' 

= \\P T (e a e b )\\ 2 F < 2 / u r/min(ni,n 2 ), (6.3) 
which we will apply several times. A related estimate is this: 

maxV \E ab \ 2 < fi r/ mm(n 1 ,n 2 ), (6.4) 

n £ — * 



a 

b 



and the same is true by exchanging the role of a and b. To see this, write 



Y,\Eab? = \KEf = ||^ % ( Ui ,e a }|| 2 = ^|(^,e a )| 2 = \\P v e 

b j<r j<r 



7 t: a || 2 , 
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and the conclusion follows from the coherence property. 

We will prove the lemmas in the case where m = n2 = n for simplicity, i.e. in the case of 
square matrices of dimension n. The general case is treated in exactly the same way. In fact, the 



argument only makes use of the bounds (6.2), (6.3) (and sometimes (6.4)), and the general case is 
obtained by replacing n with min(ni,re2). 

Each of the following subsections computes the operator norm of some random variable. In 
each section, we denote S as the quantity whose norm we wish to analyze. We will also frequently 
use the notation H for some auxiliary matrix variable whose norm we will need to bound. Hence, 
we will reuse the same notation many times rather than introducing a dozens new names — just like 
in computer programming where one uses the same variable name in distinct routines. 

6.1 Proof of Lemma 14.41 

In this section, we develop a bound on 

p- 1 \\r T ±VnV T (E) || = p- 1 1 1 TV (V Q - P T)V T (E) || 
<p- l \\{V^-pl){E)l 

where the equality follows from V^i-Vt = 0, and the inequality from Vt(E) = E together with 
||"Pji±(X)|| < ||-X"|| which is valid for any matrix X. Set 

S = p-\Vn - pT){E) = p- 1 ^(<U - P )E ab e a e* b . (6.5) 

ab 

We think of S as a rand om variable since it depends on the random 5 ab s, and note that ES = 0. 



The proof of Lemma 4.4 operates by developing an estimate on the size of (E\\S\\ q ) 1 / q for some 
q > 1 and by applying Markov inequality to bound the tail of the random variable \\S\\. To do this, 
we shall use a symmetrization argument and the noncommutative Khintchine inequality. Since the 
function f(S) = \\S\\ q is convex, Jensen's inequality gives that 

E\\S\\ q <E\\S - S'\\ q , 

where S' = p~ l Ylabi^'ab ~ P)E a b^a^t is an independent copy of S. Since (5 a b — 5' ab ) is symmetric, 
S — S' has the same distribution as 



P 1 ^ e a b(8 ab - 5' ab )E ab e a e* b = S e - S' t , 



ab 



where {e ab } is an independent Rademacher sequence and S t = p 1 J2 a b e ab^abE a b e a^ b - Further, the 
triangle inequality gives 

(E \\S e - S' t \\ q ) 1/q < (E \\S e \\ q ) 1/q + (E \\S' e \\ q ) 1/q = 2(E \\S € \\ q ) 1/q 

since S t and S' e have the same distribution and, therefore, 

(E||5||' ? ) 1 /' ? <2p- 1 (E s E e \\^e ab 6 ab E ab e a et\\^j . 



24 



We are now in position to apply the noncommutative Khintchine inequality which bounds the 
Schatten norm of a Rademacher series. For q > 1, the Schatten q-norm of a matrix is denoted by 



\X\ 



1/1 



si=l 



Note that the nuclear norm is equal to the Schatten 1-norm and the Frobenius norm is equal to 
the Schatten 2-norm. The following theorem was originally proven by Lust-Picquard [25] , and was 
later sharpened by Buchholz [9] . 

Lemma 6.1 (Noncommutative Khintchine inequality) Let (Xj)i<j< r be a finite sequence of 
matrices of the same dimension and let {u} be a Rademacher sequence. For each q > 2 







9 


E e 








i 


V 



1/9 



< Ck \[q\ max 



1/2 



1/2 



where Ck = 2 1 / 4 y / vr/e. 

For reference, if X is an n x n matrix and q > log n, we have 

||*|| < \\X\\ Sq <e||X||, 

so that the Schatten g-norm is within a multiplicative constant from the operator norm. Observe 
now that with q' > q 



(E 5 E e \\S t \\1) 1/q < (E 5 E e 1/<? < (e^E, \\S e \\ q s q/ ) 

We apply the noncommutative Khintchine inequality with q' > logn, and after a little algebra, 
obtain 



1/9' 



^E 5 E e ||5 e ||^,) 1/<? <C K e -^- (E 5 max 



J2S ab E 2 ab e a e* a \\^ 2 , || X^W^r' 



/2 



aft 



1/9' 



The two terms in the right-hand side are essentially the same and if we can bound any one of them, 
the same technique will apply to the other. We consider the first and since J2ab^abE 2 b e a e^ is a 
diagonal matrix, 

II X 5 o,bE 2 ab e a e* a \\ = maxX <WE&- 



ab 



The following lemma bounds the gth moment of this quantity. 

Lemma 6.2 Suppose that q is an integer obeying 1 < q < np and assume np > 2 logn. Then 



®6 (maxX<W^ < 2 (2np||£;||^) £ 



(6.6) 
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The proof of this lemma is in the Appendix. The same estimate applies to E (max?, ^ a o~ a bE 2 b ) q 
and thus for each q > 1 



Ea max 



^ j 5 ab El b e a e* a \\ q ) || Y 5 ab El b e b e* b \ 



ab 



ab 



< 4 (2np\\E 



|2 \Q 



111 



(In the rectangular case, the same estimate holds with n = max(ni, 712).) 

Take q = ftlogn for some > 1, and set q' = g. Then since ||-E?||oo < MlV^/ n ; we established 
that 

(E < C ~ v^bg^^ Halloo = C M i y nr/?l0gri = #0- 

Then by Markov's inequality, for each t > 0, 

P(||5|| > < t-\ 

and for t = e, we conclude that 



on r < nrpiogn > _« 



in 



with the proviso that m > max(/3, 2) nlogn so that Lemma 6.2 holds. 



We have not made any assumption in this section about the matrix E (except that we have a 
bound on the maximum entry) and, therefore, have proved the theorem below, which shall be used 
many times in the sequel. 



Theorem 6.3 Let X be a fixed n x n matrix. There is a constant Co such that for each > 2 

(6.7) 



p 



with probability at least 1 — n @ provided that np > /31ogn. 



Note that this is the same Co described in Lemma 4.4 



6.2 Proof of Lemma 14.51 

We now need to bound the spectral norm of T'x-i-'Pn'PT'H(E) and will use some of the ideas 
developed in the previous section. Just as before, 

p- 1 \\P T ±'PnPTH(E)\\ <p- x \\{V^-vT)H{E)\\, 

and put 

S = p-\Vn ~ pi) H(E) = p' 2 Y ZabtafV E a , b ,(V T e a ,et,,e a et)e a et, 

ab,a'b> 

where here and below, ^ ab = 5 ab — p. Decompose S as 

S = p~ 2 + P~ 2 E = So + Sl (6.8) 

(a,b)=(a' ,b') {a,b)j={a',V) 
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We bound the spectral norm of the diagonal and off-diagonal contributions separately. 
We begin with So and decompose (£ a &) 2 as 

fab = ~ P? = (1 " 2p)(6 ab - P )+ p(l - p) = (1 - 2p)U + p(l - p), 

which allows us to express So as 



1 - 2p 
P 



^2^abH ab e a e* b + (l-p)^2H ab e a e* b , H ab = p 1 E ab (V T e a el, e a e* b ). (6.9) 



ah 



ab 



Theorem 6.3 bounds the spectral norm of the first term of the right-hand side and we have 



p || 2_^iabH ab e a e b \\ < C \/ ll-H]k 



«6 



with probability at least 1 — n $ . Now since ||-E||oo < ^wff'/n and \(J > T e a e Xi e a e t)\ — 2/^o r / n by 



(6.2), H-fflloo < ^o^i(2r / np) y/r/n, and 



P X || ^Cafe-ffabe a e^|| < C^oMl 



aft 



m 



nr nrfilogn 



rn 



with the same probability. The second term of the right-hand side in (6.9) is deterministic and we 



develop an argument that we will reuse several times. We record a useful lemma. 
Lemma 6.4 Let X be a fixed matrix and set Z = s ^ jab X ab {T > T(e a e b ),e a e b )e a el. Then 

\\z\\< 2 J^\\x\\. 

n 

Proof Let Ajj and Ay be the diagonal matrices with entries ||P[/e a || 2 and ||iVe&|| 2 respectively, 

A v = diagdlPt/eJ 2 ), Ay = diag(||Py ef ,|| 2 ). (6.10) 



To bound the spectral norm of Z, observe that it follows from (4.7) that 

Z = A V X + XA V - A V XA V = A V X{I - Ay) + XAy. 

Hence, since \\Ajj\\ and ||Ay|| are bounded by min(yUor/n, 1) and || I — Ay|| < 1, we have 
\\Z\\ < ||A;7||||X||||J- Ay|| + ||X||||Ay|| < (2/i r/n)||X||. 



(6.11) 



Clearly, this lemma and \\E\\ = 1 give that H defined in (6.9) obeys ||jH"|| < 2/xor/np. In summary. 



ii-. n ^i nr I (5nr\ogn 

\S \\ <C — MoMiV H Mo 

m \ V m 



for some C > with the same probability as in Lemma 4.4 



It remains to bound the off-diagonal term. To this end, we use a useful decoupling lemma: 
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Lemma 6.5 [16] Let {iji}i<i< n be a sequence of independent random variables, and {xij}i^j be 
elements taken from a Banach space. Then 

P(ll ^ViVjXijW >t)< C D F(\\J2viVjXij\\ > t/C D ), (6.12) 
where {r?-} is an independent copy of{rji}. 

This lemma asserts that it is sufficient to estimate P(||S^|| > t) where S[ is given by 

S[=p~ 2 iah^'a'V Ea'v(PTea'el,,e a el)eael (6.13) 

ab^a'b' 

in which {£' ab } is an independent copy of {£ a f>}- We write S[ as 

S[ =P~ 1 ^2tabH ab e a el, H ab = p~ l ^ E a/b ,(P T e a >el,, e a e\). (6.14) 

ab a'b':(a',b')^{a,b) 

To bound the tail of ||S^||, observe that 

\S[\\ >t)< F(\\S[\\ > t | llHHoo < K) +P(||JET||oo > K). 



By independence, the first term of the right-hand side is bounded by Theorem 6.3. On the event 
{ll-fflloo < K}, we have 

p \\ } tab H ab e a e b \\ < C \ K. 

' V m 

ab 

with probability at least 1 — . To bound H-H^oo, we use Bernstein's inequality. 
Lemma 6.6 Let X be a fixed matrix and define Q{X) as the matrix whose (a,b)th entry is 

[Q(X)) ab = p~ l ^2 (Sa'v -p)X a/b ,(VTe a >e* b ,,e a el), 

a'b':(a',b')^(a,b) 

where {5 ab } is an independent Bernoulli sequence obeying ¥(5 ab = 1) = p. Then 

|Q(X)|U > \ [WL\\x\\J\ <2n 2 exp ( -^=- I ■ (0-1.5) 



np J \ 2+|,/^A 

N 3 A/ np 



With A = \/3[3 logn ; the right-hand side is bounded by 2n 2 @ provided that np > ^-/lorlogn. In 
particular, for A = y/6@ logra with (5 > 2, the bound is less than 2n~@ provided that np > ^-/J<or log n. 



Proof The inequality (6.15) is an application of Bernstein's inequality, which states that for a 
sum of uniformly bounded independent zero-mean random variables obeying |Yfc| < c, 

P (l E Y *\ > *) ^ 2e' t2 /^ 2+2ct / 3 \ (6.16) 
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where a 2 is the sum of the variances, a 2 = X]fc=i Var(Yfc). We have 

1—p 



Var([Q(X)U) 



P 



\X a , b ,\ 2 \(V T e a ,et,,e a e* b }f 



a'b':{a',b')^(a,b) 



< 



E |CPTe a e£,e a ,e£,}| 2 < i-? ||X||^2 M or/n 



by (6.3). Also 



P 1 |(<W -p)^a'fc'(^Te a 'eb/,e a efe)| < p 1 \\X\\ oo 2p r/n 



-l 



and hence, for each t > 0, (6.16) gives 



P(|[Q(X)] a6 | >t) <2exp 



z „ p 11^ lloo 3 Tip 11^ II 001 , 



(6.17) 



Putting i = Ay'//o r / n p||-^||oo for some A > and applying the union bound gives (6.15). 



Since [|-E[|oc < Hx^/r/n it follows that H = Q(E) introduced in (6.14) obeys 



Hiy/r Honr/3 log n 



ll-f lloo < C : 

n V m 

with probability at least 1 — 2n~P for each (3 > 2 and, therefore, 

nr/3 logra 



l^i 1 1 < C y/lM) Pi- 



rn 



with probability at least 1 — 3n ^. In conclusion, we have 



(6.18) 



with probability at least 1 — (1 + 3Cu)n ^. A simple algebraic manipulation concludes the proof 



of Lemma 4.5. Note that we have not made any assumption about the matrix E and, therefore, 
established the following: 



Lemma 6.7 Let X be a fixed n x n matrix. There is a constant C such that 



P 



-2| 



t t v m ( *\ *\ *ii ^ r* V 7 /^/ 31 ^ 71 !, V l 
2 , €abU'b'-X- a b{T , T{e a 'e b ,),e a e b )e a e b \\ < C ||A| 



P 



(6.19) 



(a,b)^(a',b') 

with probability at least 1 — 0(n _/3 ) for all j3 > 2 provided that np > 3//o?"/31og n. 
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6.3 Proof of Lemma 14.61 

To prove Lemma |4.6[ we need to bound the spectral norm of p- 1 (P a - pi) H 2 (E), a matrix given 
by 

P CaibiS,a2b2Ca3bsE a3 i )3 (PTe a3 e b3 , e a2 e b2 )(PT^a 2 e b2^ e f»i e 6i ) e ai e 6i ' 

aifei, 0262,^3 &3 

where £ ab = 5 ab — p as before. It is convenient to introduce notations to compress this expression. 
Set uj = (a, b) (and U{ = (oj, fy) for i = 1, 2,3), F u = e a e* b , and P^ w = (P T e a iel,,e a el) so that 

p- 1 (P n - pT) H 2 (E) = p- 3 Yl E^P^P^F^ . 

Partition the sum depending on whether some of the oj^s are the same or not 



l -(p n -pl)H 2 {E) = -z 
p p A 



E + E + E + E + E 

Jl— a;2— <^3 ui f^u>2 =<^3 =0)3^1^2 u;i=UJ25^W3 o;i5^o;2^W3 



(6.20) 



The meaning should be clear; for instance, the sum Y"! , / , is the sum over the u's such that 
W2 = W3 and wi ^ ll>2- Similarly, SanA^c^ 1S ^ e sum over * n e w's such that they are all distinct. 
The idea is now to use a decoupling argument to bound each sum in the right-hand side of (6.20) 



(except for the first which does not need to be decoupled) and show that all terms are appropriately 
small in the spectral norm. 

We begin with the first term which is equal to 



)3 FP 2 F _ l-3p + 3p 2 v 2 1 l~3p + 2p 2 v 



F 



(6.21) 



/)•' ^--^ ~~ p° 

where we have used the identity 

(U 3 = (1 - 3p + 3p 2 )^ + p(l - 3p + 2p 2 ). 
Set = -E^(p _1 -FLiaj) 2 - For the first term in the right-hand side of (6.21), we need to control 



£oj HuFu\\- This is easily bounded by Theorem 6.3 Indeed, it follows from 



\ np J 



that for each (3 > 0, 



m 



Ml 



I nr (3 log n 2 / — / 

V ~~ = ^o/"i V P lo g n 



/nr\ 5/2 
to . 



with probably at least 1 — n For the second term in the right-hand side of (6.21), we apply 
Lemma |6.4| which gives 

Wj^E^M < (2/Wn) 2 
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so that ||JjT[| < (2fiQr/np) 2 . In conclusion, the first term in (6.20) has a spectral norm which is 
bounded by 

/nr\2 / 2 fnr/3 log n\ 1/2 2 \ 

c \m) [^{—fa—j 

with probability at least 1 — n~P. 

We now turn our attention to the second term which can be written as 



p 



D -3 V f ff I 2 P P P F - - — V £ £ P P P P 



I 1-p V £ P P P F 



Put S± for the first term; bounding \\S%\\ is a simple application of Lemma 6.7 with = p 1 E W P (JJUJ , 
which gives 

II Si II <C M o /2 Mi(/31ogn) (^) 2 
since ||^/||oo — fJ-iV^/n- F° r the second term, we need to bound the spectral norm of S2 where 

Ull u)2'-^2^^\ 

Note that H is deterministic. The lemma below provides an estimate about ||ff Hqq. 
Lemma 6.8 The matrix H obeys 

\\HU<^U\E\\ 00 + 2^). (6.22) 
np \ n / 

Proof We begin by rewriting H as 

u> 

Clearly, |Pu;P 2 (J < (/ i r / ra ) 2 ||F|| oo so that it suffices to bound the first term, which is the a>th entry 
of the matrix 

Eu>Pu>u>Pu>uFu = V T {AuE + EA V - A V EA V ). 

Now it is immediate to see that AjjE £ T and likewise for EAy. Hence, 

\\V T {A V E + EA V - A[/FAy)||oo < IIAc/FlU + HFAy^ + \\P T (AuEAv)\\oo 

< 2\\E\\ 00 fior/n+\\P T (A u EA v )\\ 00 . 

We finally use the crude estimate 

HTMAc/FAvOHoo < \\V T (AuEA v )\\ < 2||A c/ FA y || < 2{fi r/n) 2 
to complete the proof of the lemma. ■ 
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As a consequence of this lemma, Theorem 6.3 gives 

||S 2 || < C^piogn — (nam + i4Vr) 
V m J 

with probability at least 1 — n~A In conclusion, the second term in (6.20) has spectral norm 
bounded by 

/nr\ 3 / 2 / / /xonr/3 log n 



C v^logn / (fiom^J 



with probability at least 1 — 0(n l<3 ). 

We now examine the third term which can be written as 



l-2p 

p3 



^ ] Coil £oj 2 •S'Wi ^2^1 ^Wl 



aJi.7^2 



P 



U)IJ^U)2 



We use the decoupling argument once more so that for the first term of the right-hand side, it 
suffices to estimate the tail of the norm of 



= P~ 2 Y" £ (2) P 



,2 

W2^1 ' 



where {£w } and are independent copies of It follows from Bernstein's inequality and 



;(2) 



the estimates 
and 



l-FWJ < 2/i r/n 



V IP l 4 < max IP I 2 V IP | 2 <f^°^\ ^ 
2^ \±Uui\ S^max^l^^l 2^ l^-i I S ^ n J n 

0)2^2 7^1 W2:U>2J^LOl 



that for each A > 00 



\H W1 \ > A 



3/2 N 



/ 



< 2 exp 



V 



\ np J 

It is now not hard to see that this inequality implies that 

P\\\H\\oo > 



2 + 



1/2 • 



■2/3+2 



provided that m > ^§^onr /31ogn. As a consequence, for each (3 > 2, Theorem 



ISill <C^ /2 /ii/?logn(^)' 



6.3 



gives 



4 We would like to remark that one can often get better estimates; when ui\ 7^ L02, the bound |P W2 wi| < 2^or/n 
may be rather crude. Indeed, one can derive better estimates for the random orthogonal model, for example. 
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with probability at least 1 — 3n 13 . The other term is equal to (1 — p) times ^ E^H^F^, and 
II ^] E UJl H 0Jl F ull || < || E Ul H Ul F Ul \\p 

<l|//|Uil^<cy^(^ 3/2 v,. 



In conclusion, the third term in (6.20) has spectral norm bounded by 

"nr \ 3 / 2 / 



C no v^logn ( 



V m 



Honrfi log n 

Ml A/ 1- VWr 



with probability at least 1 — 0(n 

We proceed to the fourth term which can be written as 



a; 



l-2p 

p3 



W1^3 



j- 1 ~ p y f e p p f 



Let S^i be the first term and set = p 5^ £t;i ^ a , 3 Ca^a^ -Ec^-Pu^i-fLi- Then Lemma 

2^ 11 mi ^ ^, .3/2 - - /^n 2 



6.4 



gives 



ISill < -^H-ffH < Cfi'n' ii\ (plogn 

np ' \m 



6.4 



where the last inequality is given by Lemma 6.7 For the other term — call it S2 — set H WI 
P' 1 EwgwaA'i E "3 P "3Ui- Then Lemma 



\S 2 \\ < 



gives 



if 1 1 



np 



Notice that H m = p 1 £) W3 £u 3 E^P^^ - p 1 & til E u , 1 P u)1 v 1 so that with G m = E^P^^ 

H = p- l [V T {Vn - pX){E) - (P a - pT){G)). 
Now for any matrix X, \\Pt(X)\\ = \\X — P T ±(X)\\ < 2||X|| and, therefore, 

\\H\\ < 2p- 1 \\(Pa-pT)(E)\\ +p - 1 \\(p n -pl)(G)\\. 



As a consequence and since ||G||oo < 1 1 E \ \ ^ , Theorem 6.3 gives for each (3 > 2, 



\h\\ < Cm 



nrj3 \ogn 



m 



with probability at least 1 — n In conclusion, the fourth term in (6.20) has spectral norm 
bounded by 
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with probability at least 1 — 0(n l<3 ). 
We finally examine the last term 

Now just as one has a decoupling inequality for pairs of variables, we have a decoupling inequality 
for triples as well and we thus simply need to bound the tail of 

S, = rT 3 V f (2)t(3) E P P F 

^1 If / j So)i So) 2 ^3 ^0)3 - 1 0)30)2 -* Wi 

<^1^^2^W3 

in which the sequences {^i* 1 '*}, and {£o> 3 ^} are independent copies of We refer to [16] 

for details. We now argue as in Section [6.2| and write Si as 

where 

H Wl =P ^ ] CoJ2^ "^2^1 > ^^2 = P ^ ] Pu> 3 U2 ■ (6.23) 

Ol 2 :o) 2 5^0)i 0)3:0)35^0)1,0)35^02 



By Lemma 6.6 we have for each j3 > 2 



//Wlo^n 

with large probability and the same argument then gives 



uii r< Vonrfilogn u nrf3logn . . 

l-ral 00 V " 00 — <^ -^00 

m m 



with probability at least 1 — An ^ As a consequence, Theorem 6.3 

\\S\\ < C /ioMi 



gives 



nr(5 log n \ 3 ^ 2 



with probability at least 1 — 0{n~^). 

To summarize the calculations of this section and using the fact that uq > 1 and fi± < uo^/r, 
we have established that if m > Ho nr(/3 log ra) , 

d-iw.-^w'wii < c(=) 2 

\m/ \ m J 

with probability at least 1 — 0(n~P). One can check that if m = A /i^nr 4 / 3 /? log n for a fixed j3 >2 
and A > 1, then there is a constant C such that 

Wp- 1 {Vn-pl)H 2 {E)\\<C\-V 2 

with probability at least 1 — 0(n _/3 ). This is the content of Lemma 



4.6 
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6.4 Proof of Lemma 14.71 

Clearly, one could continue on the same path and estimate the spectral norm of p~ 1 (Vn—pT) 7i 3 (E) 
by the same technique as in the previous sections. That is to say, we would write 



p-\Va-pl)H 3 (E) 



i=l 



F 



with the same notations as before, and partition the sum depending on whether some of the Ui's 
are the same or not. Then we would use the decoupling argument to bound each term in the sum. 
Although this is a clear possibility, one would need to consider 18 cases and the calculations would 
become a little laborious. In this section, we propose to bound the term p~ l ([Pn — pi) T~t 5 (E) with 
a different argument which has two main advantages: first, it is much shorter and second, it uses 
much of what we have already established. The downside is that it is not as sharp. 
The starting point is to note that 



p 



°H 3 (E)), 



where S is the matrix with i.i.d. entries equal to ^ = b^ — p and o denotes the Hadamard product 
(componentwise multiplication). To bound the spectral norm of this Hadamard product, we apply 
an inequality due to Ando, Horn, and Johnson [4]. An elementary proof can be found in §5.6 of 
[19]. 



Lemma 6.9 [19] Let A and B be two ni x ni matrices. Then 

||AoB|| < 

where v is the function 

v{B) = M{c{X)c{Y) : XY* = B}, 
and c{X) is the maximum Euclidean norm of the rows 

c{Xf = max^X?, 

Ki<n * — » J 



(6.24) 



To apply (6.24), we first notice that one can estimate the norm of 3 via Theorem 6. 3l Indeed 



let Z = 11* be the matrix with all entries equal to one. Then p l ~E = p 1 (Vq — pI)(Z) and thus 

' loi \ l/ 2 

n °plogn\ 



< C 



m 



(6.25) 



with probability at least 1 



n 



One could obtain a similar result by appealing to the recent 



literature on random matrix theory and on concentration of measure. Potentially this could allow 
to derive an upper bound without the logarithmic term but we will not consider these refinements 



here. (It is interesting to note in passing, however, that the two page proof of Theorem 6.3 gives a 
large deviation result about the largest singular value of a matrix with i.i.d. entries which is sharp 
up to a multiplicative factor proportional to at most y / Iogn.) 



Second, we bound the second factor in (6.24) via the following estimate: 
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Lemma 6.10 There are numerical constants C and c so that for each (3 > 2, Tt 3 (E) obeys 

v{H z {E)) < Cfi r/n (6.26) 
with probability at least 1 — 0(n _/3 ) provided that m > c /i^ 3 nr 5 / 3 (/3 log n). 



The two inequalities (6.25) and (6.26) give 



jT 1 ||3oW 3 (.E)|| < C 



/Xq nr 2 (3 log n 



with large probability. Hence, when m is substantially larger than a constant times fj^nr 2 {(3 log n) , 
we have that the spectral norm of p _1 ("Pq — pi) Tft^E) is much less than 1. This is the content of 
Lemma 14.71 

The remainder of this section proves Lemma 6.10 Set S = 7i 5 (E) for short. Because S is in 
T, S = V T (S) = PuS + SP V - PuSP v . Writing P v = £Li uju* and similarly for P v gives 



S = ^2 Uj (u*S) + - Pu)S Vj )v*. 

3=1 3=1 

For each 1 < j < r, let otj = Svj and f3* = UjS. Then the decomposition 

r r 
3=1 3=1 

where Pjj± = I — Pjj, provides a factorization of the form 



s = ir, 



X = [ni, . . . ,u r ,P u ±oci, . . . ,Pjj±a r ], 
Y = [vi, . ..,v r ,/3i, . . . ,/3 r ). 



It follows from our assumption that 



c ( ui, . . . , u r ]) = max >^ u,-.a = max Pr/eJ < unr/n, 

Ki<n z — ' J KKn 
Kj<r 



and similarly for [vi, . . . ,v r ]. Hence, to prove Lemma 6.10 it suffices to prove that the maximum 
row norm obeys c([/3i, . . . , f3 r ]) < Cy ^r/n for some constant C > 0, and similarly for the matrix 
[P u ±a 1 , . . .,P v ±cx. r }. 



Lemma 6.11 There is a numerical constant C such that for each (3 > 2, 

c([ai, . . . , a: r ]) < C^/nor/n 
with probability at least 1 — 0{n~@) provided that m obeys the condition of Lemma 



(6.27) 



6.10 



A similar estimate for [/3i, . . . , /3 r ] is obtained in the same way by exchanging the roles of u and v. 
Moreover, a minor modification of the argument gives 



c([P u ±ai, P u ±a r \) < C^//j, r/n 



(6.28) 
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as well, and we will omit the details. In short, the estimate (6.27) implies Lemma 6.10 



Proof [of Lemma |6.11] To prove (6.27), we use the notations of the previous section and write 

)(V T e b 1 )'PT(ea 1 el 1 )v j 

a 1 b 1 ,a 2 b2,a 3 b 3 
U\ ,U)2 

since for any matrix X, T , T(X)vj = Xvj for each 1 < j < r. We then follow the same steps as in 



Section 6.3 and partition the sum depending on whether some of the uVs are the same or not 



ay = p~ 



E + E + E + E + E 

UJi—UJ2—^3 U)\^U)2=^Z LO\=UJ:^U)2 LUi=L02t^^3 Uii 7^^3 



(6.29) 



The idea is this: to establish (6.27), it is sufficient to show that if "jj is any of the five terms above, 
it obeys 



' E NI 2 <cvW7 

l<3<r 



n 



(6.30) 



{piij is the zth component of *y,- as usual) with large probability. The strategy for getting such 
estimates is to use decoupling whenever applicable. 



Just as Theorem 6.3 proved useful to bound the norm of p^iVn ~ pT)H 2 {E) m Section 
the lemma below will help bounding the magnitudes of the components of ctj . 

Lemma 6.12 Define S = p^ 1 X)<j iwH w {ei, F w Vj)eie*. Then for each A > 



6.3 



S\\oo > y/no/n) < 2n 2 exp 



(6.31) 



MOP 



Proof The proof is an application of Bernstein's inequality (6.16). Note that {ei,F w Vj) = l{ a =i} v bj 
and hence 

V^iSij) Kp-^H^^^F^l 2 =p- 1 \\H\\l B 



since ^ |(e i5 F u Vj)\ 2 = 1, and \p 1 H UJ (e i , F w Vj}\ < p 1 ||-H"[|oo Vnor/n since {(e^FuV^l < \v bj \ 
and 

\vbj\ < ll-FVeftH < \fptirjn. 



Each term in (6.29) is given by the corresponding term in (6.20) after formally substituting F L 



with F w Vj. We begin with the first term whose ith component is equal to 
J ij =p- 3 (l-3p + 3p 2 )Y,^E UJ P 2 LJ (e u F UJ v j ) + p- 2 (l - 3p + 2p 2 ) £ E^Je,, F u Vj). (6.32) 
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Ignoring the constant factor (1 — 3p + 3p 2 ) which is bounded by 1, we write the first of these two 
terms as 

(S )ij = p- 1 H ^ F " v i)' H " = E " (P~ lp ^) 2 - 



Since H-H^tx, < (nonr jm) 2 [i\\fr /n, it follows from Lemma (6.12) that 

p (iisoiioo > Vnfa) ^ 2 ™ 2 ^ 1/D D <c Ut4 Q 5 + tin Q 3 ) 

for some numerical C > 0. Since [i\ < Ho\ft"i we have that when m > \uq nr e / 5 (f3 log n) for some 
numerical constant A > 0, ||<So||oo > \/f 1 o/ n with probability at most 2n 2 e~^ logn ) ; this probability 
is inversely proportional to a superpolynomial in n. For the second term, the matrix with entries 

E u>Puu> is g iven b y 

AfjE + EA 2 V + 2A V EA V + A^EA 2 , - 2AljEA v - 2A V EA\ 

and thus 

^2 EtoPLid, *>j> = (e h {A 2 V E + EAy + 2A V EA V + A 2 V EA 2 V - 2A 2 V EA V - 2A u EA 2 v )v j ). 

This is a sum of six terms and we will show how to bound the first three; the last three are dealt 
in exactly the same way and obey better estimates. For the first, we have 

(e^AfjEvj) = (Afje^Evj) = \\P u e i \\ i {e i ,u j ) 

Hence 



p' 2 J2 \{^A 2 u Ev ] )\ 2 =p- 2 \\P u e l \\ 4 / Y, \(e i ,u j )\ 2 =p~ 2 \\P u e i \\ 5 < 

V l<i<r V 1<J<»* 



( W[\ fjMyr 
\ np J V n 



In other words, when m > fiQnr, the right hand-side is bounded by \J UQr/n as desired. For the 
second term, we have 

(e, u EA 2 v Vj) = Y ll-FV e fel| 4 ^i( e i>^ e b) = Y W P v e b\\ 4 v bj E ib . 



Hence it follows from the Cauchy-Schwarz inequality and (6.4) that 

p- 2 \(e u EA 2 vVj )\< 



( IMyr\ fl^r 
\ np J Y n 



In other words, when m > /^nr 



5/4 



-2 / V- ZPA2..AI2 / l^r 

V l<i<r 



£ |(ei,^A^)P<./^- (6.33) 



as desired. For the third term, we have 

(ei^AuEAyVj) = \\P u ei\\ 2S Y\\Pveb\\ 2 VbjE ib . 
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The Cauchy-Schwarz inequality gives 



2p- 2 \(e i ,A u EA v v J )\<2(^J 



n 



just as before. In other words, when m > fionr 5 ^ 4 , 2p 2 ^Jj2i<j< r \( e ii A-tjEAvVj)\ 2 is bounded by 

2-v/ Jkor/n. The other terms obey (6.33) as well when m > /iorer 5 / 4 . In conclusion, the first term 
(6.32) in (6.29) obeys (6.30) with probability at least 1— 0(n~^) provided that m > ^nr 5 '/ 4 (/3 'log re). 
We now turn our attention to the second term which can be written as 



We decouple the first term so that it suffices to bound 

{So)ij = P £u\ H-ui ( e «) F Ul Vj) , H Ul = p 'y ^ £ W2 E^P^^P^^, 

ll>2:l02^uji 



tO-h a „j /t l2 > 



where the sequences } and {£i } are independent. The method from Section 



6.2 



shows that 



/u rer/31ogre _ x r— — 

V m w 



uo nr \ 3 / 2 



I -El 



with probability at least 1 — 2n " for each (3 > 2. Therefore, Lemma 6.12 

P (ll-SolU > \//Vn) < 2reV 1 / D , 

where D obeys 



gives 



D < C (^(/31ogn) (^) 4 + ^V^l^(^) 5/2 ) 



(6.34) 



(6.35) 



for some positive constant C. Hence, when m > A/io rer 5//4 (/?log n) for some sufficiently large 
numerical constant A > 0, we have that ||Sb||oo > V X /W n with probability at most 2n 2 e~ ( - /31ogn * ) . 
This is inversely proportional to a superpolynomial in re. We write the second term as 

("§i)ij = P y Hlo\ (fii > E u j 1 Vj ) , Hui = P y E ul2 P U j 2U)2 P ul2LU1 . 



We know from Section 6.3 that H obeys || -H^oo < C Ug r 2 /m since ui < /J.Q\/r so that Lemma 
gives 

ISilloc > < 2reV 1 / D , D<C 



6.12 



/?"/•' Q n 2 r 5 ' 2 



3" ' i 2 
^0 ~3 + A'o 



for some C > 0. Hence, when m > A/io^^ 4//3 (/31ogn) for some numerical constant A > 0, we have 
that [|Si[|oo > \/no/ n with probability at most 2re 2 e _( -^ logn ) . This is inversely proportional to a 
superpolynomial in re. In conclusion and taking into account the decoupling constants in (6.12), 
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the second term in (6.29) obeys (6.30) with probability at least 1 — 0{n~P) provided that m is 
sufficiently large as above. 

We now examine the third term which can be written as 

For the first term of the right-hand side, it suffices to estimate the tail of 



where an d are independent. We know from Section 6.3 that H-fTH^ obeys H-ffHoo < 

C y/ (5 logn (fionr /to) 3 / 2 with probability at least 1 — 2n~@ for each j3 > 2. Thus, Lemma (6.12) 
shows that So obeys ( |6.34 )— ( 6.35 ) just as before. The other term is equal to (1 — p) times 
Ylu>i E U i 1 H ( j 1 (ei,F (JJ1 Vj}, and by the Cauchy-Schwarz inequality and (6.4) 



^ E L o\El u)1 (fii, F LUl Vj i 



< \\H\ 



1/2 



< c 



n 



4/3 



3/2 



on the event where H-H^oo < C y 7 /? log n (fJ.o nr / m) 3//2 . Hence, when m > A^o nr^^ (/?logn) for 
some numerical constant A > 0, we have that | ^ E Wl H LU1 (e«, F wl Vj)\ < y/fio/n on this event. In 
conclusion, the third term in (6.29) obeys (6.30) with probability at least 1 — 0{n~^) provided that 
m is sufficiently large as above. 

We proceed to the fourth term which can be written as 

p (1 — 2p) ^ £,u) 1 £,oJ3 E (JJz P u} . iU!l P U}llJ j 1 (ej, F Ul Vj) -\- p (1 p) ^ ^ ^ 3 E u] . i P U j zlJ j 1 P lJ j lU j 1 (e^, F u , 1 Vj). 

We use the decoupling trick for the first term and bound the tail of 

(5 )y = p' 1 J2^i H ^P~ lp ^ ^ F <* v i)> H <* = P' 1 E & E^P^, 



tW\ „„A ft(3) 



where {Q, } and {£i } are independent. We know from Section 



6.2 



that 



|g|ioc<^ r " r/31ogra i^i 



with probability at least 1 — 2n _/3 for each /3 > 2. Therefore, Lemma 6.12 shows that So obeys 
(6.34 )-( 6.35 ) just as before. The other term is equal to (1— p) times ^ H UJl (p~ Pu lUl ) (ei,F Ul Vj), 
and the Cauchy-Schwarz inequality gives 



<V n ll-"l|oo <C 



m 



m 



on the event HJEfHoo < C \J fionr((31og n) /to H-EHoo. Because fii < ^o\/r, we have that whenever 
fn > A /x^ 3 nr 5 / 3 (/? log n) for some numerical constant A > 0, p~ x \ X) w H u>1 P UJlu , 1 (ei, Fun v j}\ — 
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y/fio/n just as before. In conclusion, the fourth term in (6.29) obeys (6.30) with probability at 
least 1 — 0(n~P) provided that m is sufficiently large as above. 
We finally examine the last term 

P ^ 1 Cull £,u>2 £,u>3 Eu> z Pu)$u>2 Pu) 2 UJl ( e i ) ^Ul V j ) ■ 

Just as before, we need to bound the tail of 

aJi,aJ2,W3 



where H is given by (6.23). We know from Section 6.3 that H obeys 



\H\\^<C(pio g n)^^ 
m n 



with probability at least 1 — 4n ^ for each (3 > 2. Therefore, Lemma 



6.12 



gives 



1-SbHoo > ~>/a*oA») <2n 2 e- 1 /°, D < C 



fjtonl(/3\ogn) 



nr\ 3 



??? 



(TIT \ ■ 

m J 



for some C > 0. Hence, when m > A//o nr 4//3 (/31og n) for some numerical constant A > 0, we have 
that HjSqHoo > ^y /fio/n with probability at most 2n 2 e _ ^ logri ). In conclusion, the fifth term in 
(6.29) obeys (6.30) with probability at least 1 — 0(n~P) provided that m is sufficiently large as 



above. 

To summarize the calculations of this section, if m = A ti^ 3 nr 5 / 3 {(5 log n) where > 2 is fixed 
and A is some sufficiently large numerical constant, then 

^ \aij\ 2 < Hor/n 

with probability at least 1 — 0(n~ /3 ). This concludes the proof. ■ 



6.5 Proof of Lemma 14.81 

It remains to study the spectral norm of P^ l {VT^^iPT)Y^k>k '^ k ^E) f° r some positive integer 
ko, which we bound by the Frobenius norm 

p- X \\{V T ±.VsiP T ) £ H k {E)\\ <p- x \\{VaP T ) £ K k (E)\\F 

k>ko k>ko 

<V3/2p\\Y,n k (E)\\ F , 

k>ko 



where the inequality follows from Corollary 4.3 To bound the Frobenius of the series, write 



K k {E)\\ F < \\H\\ ko \\E\\ F + \\H\\ ko+1 \\E\\ F + 

k>ko 



\\H\\ ko 
< „ E f. 

- i - n 11 11 
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Theorem 



4.1 



gives an upper bound on \\H\\ since \\Ti\\ < Cr \J hqut (3 log n/m < 1/2 on an event 



with probability at least 1 — 3n @ . Since = y/r, we conclude that 



■ 1 \\{VaP T ) y EH k {E!)\\ F <C-±= 

k>k v 



fi nrf3 log k ° /2 ^_ _ ^ 



m 



2 \ V 2 / or \ fc /2 

nr\ { fj,Qnrpiogn\ 



rn 



with large probability. This is the content of Lemma 4.8 



7 Numerical Experiments 

To demonstrate the practical applicability of the nuclear norm heuristic for recovering low-rank 
matrices from their entries, we conducted a series of numerical experiments for a variety of the 
matrix sizes n, ranks r, and numbers of entries m. For each (n, m, r) triple, we repeated the 
following procedure 50 times. We generated M, an n x n matrix of rank r, by sampling two n x r 
factors Ml and Mr with i.i.d. Gaussian entries and setting M = MlM r . We sampled a subset 
fl of m entries uniformly at random. Then the nuclear norm minimization 

minimize ||-X"||* 

subject to Xij = Mij, G O 

was solved using the SDP solver SDPT3 [34]. We declared M to be recovered if the solution 
returned by the SDP, X opt , satisfied 1 1 -XT Q pt — -^1I-f/||-^||f < 10 -3 . Figure [l] shows the results 
of these experiments for n = 40 and 50. The x-axis corresponds to the fraction of the entries of 
the matrix that are revealed to the SDP solver. The y-axis corresponds to the ratio between the 
dimension of the set of rank r matrices, d r = r(2n — r), and the number of measurements m. Note 
that both of these axes range from zero to one as a value greater than one on the x-axis corresponds 
to an overdetermined linear system where the semidefinite program always succeeds, and a value of 
greater than one on the y-axis corresponds to a situation where there is always an infinite number 
of matrices with rank r with the given entries. The color of each cell in the figures reflects the 
empirical recovery rate of the 50 runs (scaled between and 1). White denotes perfect recovery in 
all experiments, and black denotes failure for all experiments. Interestingly, the experiments reveal 
very similar plots for different n, suggesting that our asymptotic conditions for recovery may be 
rather conservative. 

For a second experiment, we generated random positive semidefinite matrices and tried to 
recover them from their entries using the nuclear norm heuristic. As above, we repeated the same 
procedure 50 times for each (n, m, r) triple. We generated M, annxn positive semidefinite matrix 
of rank r, by sampling an n x r factor Mp with i.i.d. Gaussian entries and setting M = MpMp. 
We sampled a subset O of m entries uniformly at random. Then we solved the nuclear norm 
minimization problem 

minimize trace(X) 
subject to = Mij, £ £1 ■ 

X>z0 

As above, we declared M to be recovered if ||-X"opt — < 10~ 3 . Figure [2] shows the 

results of these experiments for n = 40 and 50. The x-axis again corresponds to the fraction 
of the entries of the matrix that are revealed to the SDP solver, but, in this case, the number of 
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(a) (b) 

Figure 1: Recovery of full matrices from their entries. For each (n, m, r) triple, we 
repeated the following procedure 50 times. A matrix M of rank r and a subset of m entries were 
selected at random. Then we solved the nuclear norm minimization for X subject to = My 
on the selected entries. We declared M to be recovered if ||X opt — M\\ F /\\M\\ F < 1CT 3 . The 
results are shown for (a) n = 40 and (b) n = 50. The color of each cell reflects the empirical 
recovery rate (scaled between and 1). White denotes perfect recovery in all experiments, and 
black denotes failure for all experiments. 



measurements is divided by D n = n(n+l)/2, the number of unique entries in a positive-semidefinite 
matrix and the dimension of the rank r matrices is d r = nr — r(r — l)/2. The color of each cell 
is chosen in the same fashion as in the experiment with full matrices. Interestingly, the recovery 
region is much larger for positive semidefmite matrices, and future work is needed to investigate if 
the theoretical scaling is also more favorable in this scenario of low-rank matrix completion. 

Finally, in Figure [3] we plot the performance of the nuclear norm heuristic when recovering 
low-rank matrices from Gaussian projections of these matrices. In these cases, M was generated 
in the same fashion as above, but, in place of sampling entries, we generated m random Gaussian 



projections of the data (see the discussion in Section 1.4). Then we solved the optimization 



minimize ll^ll* 

subject to A{X) = A(M) ' 

with the additional constraint that X >z in the positive semidefmite case. Here A(X) denotes a 



linear map of the form (1.15) where the entries are sampled i.i.d. from a zero- mean unit variance 
Gaussian distribution. In these experiments, the recovery regime is far larger than in the case 
of that of sampling entries, but this is not particularly surprising as each Gaussian observation 
measures a contribution from every entry in the matrix M. These Gaussian models were studied 
extensively in [27]. 
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(a) 



(b) 



Figure 2: Recovery of positive semideflnite matrices from their entries. For each 
(n, to, r) triple, we repeated the following procedure 50 times. A positive scmidefinitc matrix 
M of rank r and a set of to entries were selected at random. Then we solved the nuclear norm 
minimization subject to Xij — Mij on the selected entries with the constraint that X y 0. 
The color scheme for each cell denotes empirical recovery probability and is the same as in 
Figure [T| The results are shown for (a) n = 40 and (b) n = 50. 




Figure 3: Recovery of matrices from Gaussian observations. For each (n, to, r) triple, 
we repeated the following procedure 10 times. In (a), a matrix of rank r was generated as in 
Figures [I] In (b) a positive semideflnite matrix of rank r was generated as in Figures [2] In 
both plots, we select a matrix A from the Gaussian ensemble with to rows and n 2 (in (a)) or 
D n = n(n + l)/2 (in (b)) columns. Then we solve the nuclear norm minimization subject to 
A(X) — A(M). The color scheme for each cell denotes empirical recovery probability and is 
the same as in Figures [T] and [2| 
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8 Discussion 



8.1 Improvements 

In this paper, we have shown that under suitable conditions, one can reconstruct an n x n matrix 
of rank r from a small number of its sampled entries provided that this number is on the order 
of n 1,2 r log n, at least for moderate values of the rank. One would like to know whether better 
results hold in the sense that exact matrix recovery would be guaranteed with a reduced number 
of measurements. In particular, recall that an n x n matrix of rank r depends on (2n — r)r degrees 
of freedom; is it true then that it is possible to recover most low-rank matrices from on the order 
of nr — up to logarithmic multiplicative factors — randomly selected entries? Can the sample size 
be merely proportional to the true complexity of the low-rank object we wish to recover? 

In this direction, we would like to emphasize that there is nothing in our approach that appar- 
ently prevents us from getting stronger results. Indeed, we developed a bound on the spectral norm 



of each of the first four terms (V^±VqVt)'H {E) in the series (4.13) (corresponding to values of k 
equal to 0, 1, 2, 3) and used a general argument to bound the remainder of the series. Presumably, 
one could bound higher order terms by the same techniques. Getting an appropriate bound on 
||('P^±7 : Y2'Pt)W 4 (-E) II would lower the exponent of n from 6/5 to 7/6. The appropriate bound on 
\\('P t ±T : 'qT : 't)'H 5 (E) II would further lower the exponent to 8/7, and so on. To obtain an optimal 
result, one would need to reach k of size about logn. In doing so, however, one would have to 
pay special attention to the size of the decoupling constants (the constant Cn for two variables in 



Lemma 6.5 ) which depend on k — the number of decoupled variables. These constants grow with k 



and upper bounds are known [15, 16]. 



8.2 Further directions 



It would be of interest to extend our results to the case where the unknown matrix is approximately 
low-rank. Suppose we write the SVD of a matrix M as 

M = ^2 v k u k v* k , 

l<k<n 

where o"i > (T2 > . . . > u n > and assume for simplicity that none of the a k s vanish. In general, it 
is impossible to complete such a matrix exactly from a partial subset of its entries. However, one 
might hope to be able to recover a good approximation if, for example, most of the singular values 
are small or negligible. For instance, consider the truncated SVD of the matrix M, 

Kk<r 



where the sum extends over the r largest singular values and let M± be the solution to ( 1.5 ). Then 
one would not expect to have = M but it would be of great interest to determine whether the 
size of — M is comparable to that of M — M r provided that the number of sampled entries 
is sufficiently large. For example, one would like to know whether it is reasonable to expect that 
||-M* — M\\* is on the same order as \\M — M r \\* (one could ask for a similar comparison with a 
different norm). If the answer is positive, then this would say that approximately low-rank matrices 
can be accurately recovered from a small set of sampled entries. 
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Another important direction is to determine whether the reconstruction is robust to noise as in 
some applications, one would presumably observe 



where z is a deterministic or stochastic perturbation. In this setup, one would perhaps want to 
minimize the nuclear norm subject to — V)||f < e where e is an upper bound on the 

noise level instead of enforcing the equality constraint Vci(X) = Vq,{Y). Can one expect that this 
algorithm or a variation thereof provides accurate answers? That is, can one expect that the error 
between the recovered and the true data matrix be proportional to the noise level? 

9 Appendix 

9.1 Proof of Theorem IPl 



The proof of (4.10) follows that in [10] but we shall use slightly more precise estimates. 

Let Yy, . . . , Y n be a sequence of independent random variables taking values in a Banach space 
and let Y+ be the supremum defined as 

n 

n = su P V/(y i ), (9.i) 

where J 7 is a countable family of real-valued functions such that if / € J-, then — / G T . Talagrand 
[33] proved a concentration inequality about Y*, see also [22, Corollary 7.8]. 

Theorem 9.1 Assume that \ f\ < B and E f{Yi) = for every f in T and i = 1, . . . , n. Then for 
all t > 0, 

Pfln-Eni >t)< 3exp log ( 1 + o , B * TF ) ) > ( 9 - 2 ) 



KB ta V o- 2 + BEY^ 
where a 2 = supj g: p ^ILi^/ 2 (^); an d K is a numerical constant. 

We note that very precise values of the numerical constant K are known and are small, see [20] . 

We will apply this theorem to the random variable Z defined in the statement of Theorem |4.2 
Put y ab = P ~ l (Sab - P) V T {e a el) ® V T (e a e* b ) and y = E a6 ^a6. By definition, 



z = sup (x 1 ,y(x 2 )) = sup J2( x iMx 2 )) 

ab 

= sup p- 1 ^2(S ab -p){X 1 ,r T (e a e* b )) (P T (e a e^,X 2 ) 



ab 

where the supremum is over a countable collection of matrices X\ and X 2 obeying ||J?"i||_p < 1 and 



\X 2 1| f < 1- Note that it follows from (4.8) 



(X 1 ,y ab (X 2 ))\=p~ l \5 ab -p\\(X l ,VT(e a e* b ))\\(V T (e a et),X 2 )\ 

<p~ l \\P T {e a e* b )\\ F < 2fi r/(min(n 1 ,n 2 )p) = 2/i nr/m 
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(recall that n = max(m,ri2))- Hence, we can apply Theorem 9.1 with B = 2\iQ{nr j m) . Also 

E\(X 1 ,y ab (X 2 ))\ 2 = p-\l - p)KX 1 ,P T (e a e^))| 2 KX 2j P T (e a e^))| 2 
^p-'WVTieaeDWl \(V T (X 2 ),e a et)\ 2 

so that 

£ E y ab (X 2 ))| 2 < (2^0 nr/m) ^ |(P T (X 2 ), e a e^)| 2 

aft aft 

= (2/i nr/m) \\V T (X 2 )t F < 2fi Q nr/m. 



Since EZ < 1, Theorem 9.1 gives 



P(|Z-EZ| > / ) ;£ :](<xp ( _ JL l og (l + t/2)) < 3exp f-^mmtl, t/2) 



where we have used the fact that log(l + u) > (log 2) min(l, u) for u > 0. Plugging i = Ay Mo "m" 6 ~ 
and B = 2uonr/m establishes the claim. 

9.2 Proof of Lemma 16.21 

We shall make use of the following lemma which is an application of well-known deviation bounds 
about binomial variables. 

Lemma 9.2 Let {5i}i<i< n be a sequence of i.i.d. Bernoulli variables with P(<5j = 1) = p and 
Y = Ya=1 ^i- Then for each A > 0, 

P(Y > A EY) < exp (~ 2+ ^ A/3 ®y) . (9.3) 

The random variable Y^b ^abE 2 b 1S bounded by H-EH^ 6 a b and it thus suffices to estimate the 



qth moment of Y* = maxF n where Y a = ^2 b S a b- The inequality (9.3) implies that 



P(y > Xnp) < n exp (- - _^ 2X ^ np 
and for A > 2, this gives P(y* > Xnp) < ne~ Xnp l 2 . Hence 

/'OO /'OO 

EY q = / P(y* > t) qt q ~ x dt < (2np) q + n e~ t/2 qt q - 1 dt. 

JO J2np 

By integrating by parts, one can check that when q < np, we have 



/>oo 

/ n e- t/2 qt q - 1 dt < nq (2np) q e~ np . 

J 2nr> 



1 2np 

Under the assumptions of the lemma, we have nqe~ np < 1 and, therefore, 

EY q <2{2np) q . 

The conclusion follows. 
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